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Abstract 



Gaussian processes (GP) are powerful tools for probabilistic modeling purposes. They 
can be used to define prior distributions over latent functions in hierarchical Bayesian 
models. The prior over functions is defined implicitly by the mean and covariance function, 
which determine the smoothness and variability of the function. The inference can then 
be conducted directly in the function space by evaluating or approximating the posterior 
process. Despite their attractive theoretical properties GPs provide practical challenges 
in their implementation. GPstuff is a versatile collection of computational tools for GP 
models compatible with Linux and Windows MATLAB and Octave. It includes, among 
others, various inference methods, sparse approximations and tools for model assessment. 
In this work, we review these tools and demonstrate the use of GPstuff in several models. 
Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes 
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1. Introduction 



This work reviews a free open source toolbox GPstuff (version 4.1) which implements a 
collection of inference methods and tools for inference for various Gaussian process (GP) 



models. The toolbox is compatible with Unix and Windows MATLAB ( Mathworks 2010| ) 



(r2009b or later, earlier versions may work, but has not been tested extensively) and Octave 
( |Octave community 2012) (3.6.4 or later, compact support covariance functions are not 
currently working in Octave) . It is available from |http : / /bees . aalto . f i/ en/ research/| 
bayes /gpstuff/j 

GP is a stochastic process, which provides a powerful tool for probabilistic inference 



directly on distributions over functions (e.g. O'Hagan 1978) and which has gained much 
attention in recent years (Rasmussen and Williams, 2006). In many practical GP models, 
the observations y = [yi, y n ] 1 related to inputs (covariates) X = {x^ = [x^i, £i,<2] T }™=i 
are assumed to be conditionally independent given a latent function (or predictor) /(x) so 
that the likelihood p(y | f ) = rE=iP(y«l/i)> where f = [/(xi), /(x n )] T , factorizes over 
cases. GP prior is set for the latent function after which the posterior p(f\ y, X) is solved 
and used for prediction. GP defines the prior over the latent function implicitly through the 
mean and covariance function, which determine, for example, the smoothness and variability 
of the latent function. Usually, the model hierarchy is extended to the third level by giving 
also prior for the parameters of the covariance function and the observation model. 

Most of the models in GPstuff follow the above definition and can be summarized as: 



Observation model: 

GP prior: 
hyperprior: 



y |f,<£~ JJp(yi|/i,' 



8=1 



f(x)\6~g-p (m(x),fc(x,x'|0)) 



(1) 

(2) 
(3) 



Here 6 denotes the parameters in the covariance function k(pc, x' \6), and (ft the parameters in 
the observation model. We will call the function value /(x) at fixed x a latent variable and for 
brevity we sometimes denote all the covariance function and observation model parameters 
with i? = [8,(f>]- For the simplicity of presentation the mean function is considered zero, 
m(x) = 0, throughout this paper. We will also denote both the observation model and 
likelihood by p(y | f , 4>) and assume the reader is able to distinguish between these two from 
the context. The likelihood naming convention is used in the toolbox for both likelihood 
and observation model related functionalities which follows the naming convention used, for 
example, in INLA (Rue et al. 2009) and GPML (Rasmussen and Nickisch 2010) software 
packages. There are also models with multiple latent processes and likelihoods where each 
factor depens on multiple latent values. These are discussed in Section [8] 

Many GP models such as general dynamic linear models and multi-output models, do 
not fulfill the assumptions Q-Q. However, the current implementation of GPstuff relies 
on them and as will be seen, the class of models fulfilling these assumptions is rather rich. 
Even though the observations are assumed to be conditionally independent given the latent 
variables, the dependencies between the observations are incorporated into the model via the 
GP prior. At the moment, only independent priors on and (ft are provided, as highlighted 
by equation but general form p(9, (f>) could be used as well. 

Early examples of GP models can be found, for example, in time series analysis and 
filtering (Wiener 1949) and in geostatistics (e.g. Matheron 1973). GPs are still widely used 



in these fields and useful overviews are provided in ( 



Cressie 



2001 Diggle and Ribeiro 2007 Gelfand et al. 2010). 



1993 



Grewal and Andrews 



O'Hagan (1978) was one of the firsts 



to consider GPs in a general probabilistic modeling context and provided a general theory of 
GP prediction. This general regression framework was later rediscovered as an alternative 



for neural network models (Williams and Rasmussen 



1996 



Rasmussen 1996) and extended 



for other problems than regression (Neal 1997, Williams and Barber |1998 ). This machine 



learning perspective is comprehensively summarized by Rasmussen and Williams (2006). 
Nowadays, GPs are used, for example, in weather forecasting (Fuentes and Raftery 



2005 Berrocal et al. 2010), spatial statistics (Best et al. 



Banerjee et al. 2008), computational biology (Honkela et al. 



(Deisenroth et al. 2009 2011), healthcare applications (Stegle et a 



2010), industrial applications (Kalliomaki et al. 



2005 



emulation (Kennedy and O'Hagan 


2001 


Conti et al. 


2009 


, prior elicitation (Moala and 


O'Hagan 


2010 


) and density estimation ( 


Tokdar and Ghosh 


2007 


Tokdar et al. 


2010 


) to 



2005 



2011 



Kaufman et al. 2008 



, reinforcement learning 



2008 Vanhatalo et al. 



computer model calibration and 



name a few. Despite their attractive theoretical properties and wide range of applications, 
GPs provide practical challenges in implementation. 

GPstuff provides several state-of-the-art inference algorithms and approximative meth- 
ods that make the inference easier and faster for many practically interesting models. GP- 
stuff is a modular toolbox which combines inference algorithms and model structures in an 
easily extensible format. It also provides various tools for model checking and comparison. 
These are essential in making model assessment and criticism an integral part of the data 
analysis. Many algorithms and models in GPstuff are proposed by others but reimplemented 
for GPstuff. In each case, we provide reference to the original work but the implementation 
of the algorithm in GPstuff is always unique. The basic implementation of two important 
algorithms, the Laplace approximation and expectation propagation algorithm (discussed 



in Section [3| , follow the pseudocode from ( Rasmussen and Williams 



are later generalized and improved as described in (Vanhatalo et al. 



2006). However, these 



2009 



2010 Vanhatalo 



and Vehtari, 2010 Jylanki et al. 2011). 



There are also many other toolboxes for GP modelling than GPstuff freely available. 
Perhaps the best known packages are nowadays the Gaussian processes for Machine Learning 
GPML toolbox (Rasmussen and Nickisch, 2010) and the flexible Bayesian modelling (FBM) 
toolbox by Radford Neal. A good overview of other packages can be obtained from the 
Gaussian processes website |ht t p : // www . gaus s i anpr o c e s s . org/ and the R Archive Network 



http://cran.r-project.org/ Other GP softwares have some overlap with GPstuff and 
some of them include models that are not implemented in GPstuff. The main advantages 
of GPstuff over other GP software are its versatile collection of models and computational 
tools as well as modularity which allows easy extensions. A comparison between GPstuff, 
GPML and FBM is provided in the Appendix [A] 

Three earlier GP and Bayesian modelling packages have influenced our work. Structure 
of GPstuff is mostly in debt to the Netlab toolbox (Nabney 2001 ), although it is far from be- 
ing compatible. The GPstuff project was started in 2006 based on the MCMCStuff-toolbox 
(1998-2006) ( http : //bees . aalto . f i/en/research/bayes/mcmcstuf f/\ . MCMCStuff for 
its part was based on Netlab and it was also influenced by the FBM. The INLA software 
package by Rue et al. (2009) has also motivated some of the technical details in the tool- 
box. In addition to these, some technical implementations of GPstuff rely on the sparse 
matrix toolbox SuiteSparse (Davis, 2005) ( |http : //www . cise . uf 1 . edu/ research/ sparse/ 
SuiteSparse/). 

This work concentrates on discussing the essential theory behind the implementation 
of GPstuff. We explain important parts of the code, but the full code demonstrating the 
important features of the package (including also data creation and such), are included in 
the demonstration files demo_* to which we refer in the text. 



2. Gaussian process models 
2.1 Gaussian process prior 

GP prior over function /(x) implies that any set of function values f, indexed by the input 
coordinates X, have a multivariate Gaussian distribution 

p(f |X,0) = N(f|O,K f)f ), (4) 

where Kf f is the covariance matrix. Notice, that the distribution over functions will 
be denoted by QV(-,-), whereas the distribution over a finite set of latent variables will 
be denoted by N(-,-). The covariance matrix is constructed by a covariance function, 
[Kf f]i j = fc(xj,Xj|0), which characterizes the correlation between different points in the 
process. Covariance function can be chosen freely as long as the covariance matrices pro- 
duced are symmetric and positive semi-definite (v T Kf j f v > 0, Vv G K n ). An example of a 
stationary covariance function is the squared exponential 

fese(xi,Xj \9) = cr s 2 c exp(-r 2 /2), (5) 

where r 2 = Ylt=i( x i,k ~ x j,k) 2 ' /tf. and 6 = [cr 2 e , l\, Id]. Here, <j 2 e is the scaling parameter, 
and Ik is the length-scale, which governs how fast the correlation decreases as the distance 



increases in the direction k. Other common covariance functions are discussed, for example, 



by Diggle and Ribeiro (2007), Finkenstadt et al. (2007) and Rasmussen and Williams (2006) 



and the covariance functions in GPstuff are summarized in the appendix [B] 

Assume that we want to predict the values f at new input locations X. The joint prior 
for latent variables f and f is 



|X,X,0~ N 0, 



K f)f 
K f,f 



K f,f 



(6) 



where K^f = A;(X, X|#), K f | = fc(X,X|#) and Kfj = fc(X, X|#). Here, the covariance 
function k(-, •) denotes also vector and matrix valued functions /c(x, X) : K d x K rfxn — > 3ft lxn , 
and fc(X,X) : 5ft dxn x 3f? dx " 3f? nxn By definition of GP the marginal distribution of f is 
p(f|X, 9) = N(f|0,Kjj;) similar to Q. The conditional distribution of f given f is 



f | f , X, X, 9 ~ N(Kj f Kjjf f , K f • 



Kf,f K" fif K f i), 



(7) 



where the mean and covariance of the conditional distribution are functions of input vector 
x and X serves as a fixed parameter. Thus, the above distribution generalizes to GP 
with mean function m(x.\9) = fc(x, X|#)Kr f f and covariance fc(x, x'|0) = k(~k,x.'\9) — 
k(5t, X|#) Kr f A;(X, x'|0), which define the conditional distribution of the latent function 
/(*)• 



2.2 Conditioning on the observations 

The cornerstone of the Bayesian inference is Bayes' theorem by which the conditional prob- 
ability of the latent function and parameters after observing the data can be solved. This 
posterior distribution contains all information about the latent function and parameters 
conveyed from the data T> = {X, y} by the model. Most of the time we cannot solve the 
posterior but need to approximate it. GPstuff is built so that the first inference step is to 
form (either analytically or approximately) the conditional posterior of the latent variables 
given the parameters 

p(y|f,0)p(f|,X,0) 



p(f\V,9,< 



(8) 



/p(y|f,0)p(f|X,0)df' 

which is discussed in the section [3j After this, we can (approximately) marginalize over the 
parameters to obtain the marginal posterior distribution for the latent variables 



p(f\V) = J p(f\V,9,cf>) P (9, 



(9) 



treated in the section [5] The posterior predictive distributions can be obtained similarly by 
first evaluating the conditional posterior predictive distribution, for example p(f\T>, 9, (j), x), 
and then marginalizing over the parameters. The joint predictive distribution p(y\T>, 9, <f>, x) 
would require integration over possibly high dimensional distribution p(f\T>, 9, (j), x). How- 
ever, usually we are interested only on the marginal predictive distribution for each jji 
separately which requires only one dimensional integrals 



p(yi\V,Xi,9,< 



PiVilfh <P)p(fi\^, Xj, 9, 4>)dfi. 



(10) 



If the parameters are considered fixed, GP's marginalization and conditionalization 
properties can be exploited in the prediction. Given the conditional posterior distribu- 
tion p(f \T>, 9, (ft), which in general is not Gaussian, we can evaluate the posterior predictive 
mean from the conditional mean Ej| f e ^[/(x)] = /c(x, X) f (see equation ([7| and the 

text below it). Since this holds for any f, we obtain a parametric posterior mean function 



mp(x|X>,0,0) = / E f | W [/(x)b(f \V,9,<ft)df = k(x,X\9)K- i ] i Ei lv> e )4> [f}. 



11 



The posterior predictive covariance between any set of latent variables, f is 



Cov 



f\T>,9,4> 



[f ] = E. 



f \T>,9,4> 



Cov f , f [f] +Cov { \ VA(t> E f|f [f] 



12) 



where the first term simplifies to the conditional covariance in equation ([7]) and the second 
term to A;(x, X) K^ f Covf \x>,e,</> [f] f &(X, x'). The posterior covariance function is then 

fc p (x,x'|P,M) = fc(x,x'|0) - fc(x,X|0) (K£-K£ Cov f | W [f]K- f | f ) fc(X,x'|0). (13) 

From now on the posterior predictive mean and covariance will be denoted m p (x) and 
fep(x, x ). 

Even if the exact posterior p(f\T>,9,cft) is not available in closed form, we can still ap- 
proximate its posterior mean and covariance functions if we can approximate E f \x>,d,<f> an( l 
Covf \D,e,4> [f]- A common practice to approximate the posterior p(f \T>, 9, (ft) is either with 
Markov chain Monte Carlo (MCMC) (e.g. ~ 



Rasmussen 



2005 



(e.g. Williams and Barber 



1998 



1997 


1998, 


Diggle et al. 1998, 


Kuss and 



Christensen et al. 2006) or by giving an analytic approximation to it 



Gibbs and Mackay 2000, Minka, 2001 Csato and Opper 



2002 Rue et al. 2009). The analytic approximations considered here assume a Gaussian 



form in which case it is natural to approximate the predictive distribution with a Gaussian 



as well. In this case, the equations (11) and (13) give its mean and covariance. Detailed 



considerations on the approximation error and the asymptotic properties of the Gaussian 



approximation are presented, for example, by Rue et al. (2009) and Vanhatalo et al. (2010). 



3. Conditional posterior and predictive distributions 

3.1 Gaussian observation model: the analytically tractable case 

With a Gaussian observation model, yj ~ N(/j,cr 2 ), where a 2 is the noise variance, the 
conditional posterior of the latent variables can be evaluated analytically. Marginalization 
over f gives the marginal likelihood 

p(y |X, 9, a 2 ) = N(y |0, K f>f Wl). (14) 

Setting this in the denominator of the equation (|8]), gives a Gaussian distribution also for 
the conditional posterior of the latent variables 

f \V, 9, a 2 ~ N(K f ,f (Kf, f Wl)' 1 y, K f>f - K fjf (K fjf +CJ 2 !)- 1 K f)f ). (15) 



Since the conditional posterior of f is Gaussian, the posterior process, or distribution 



p(f\T>), is also Gaussian. By placing the mean and covariance from (15) in the equations 



lib and (13) we obtain the predictive distribution 



f\V,ey~g<p (m p (x),A; p (x,x / )) 



(16) 



where the mean is m p (x) = fc(x, X)(Kf j f +cr 2 I)~ 1 y and covariance is fe p (x, x') = k(x, x') — 
k(5t, X)(Kf f +<t 2 I) -1 /c(X, x'). The predictive distribution for new observations y can be 
obtained by integrating p(y\V,6,a 2 ) = f p(y\f , a 2 )p({ \T>, 9, a 2 )di . The result is, again, 
Gaussian with mean Ej^^f] and covariance Cov^-p^f] + a 2 I. 

3.2 Laplace approximation 

With a non-Gaussian likelihood the conditional posterior needs to be approximated. The 
Laplace approximation is constructed from the second order Taylor expansion of logp(f | y, 0, <; 
around the mode f , which gives a Gaussian approximation to the conditional posterior 



p(f\v,e 

arg maxf p(f \T>, 6, 



«g(f|P,M) = N(f|f,E), (17) 
and is the Hessian of the negative log conditional 



where f 

posterior at the mode (iGelman et al.l 20041 Rasmussen and Williams 2006): 



-VV logp(f \v,e, 



lf=f 



K- f i f +W, 



where W is a diagonal matrix with entries Wjj = V^V^ logp(y\fi,< 



approximation scheme Laplace method following Williams and Barber ( 1998 ) , but essentially 



the same approximation is named Gaussian approximation by Rue et al. (2009) 



If, 



Si 



(18) 



We call the 



Setting Efpe[f] = f and Covfp^ff] = (K^ +W) 1 into (11) and (13) respectively, 

gives after rearrangements and using Kf f f = V logp(y | f)| f= j, the approximate posterior 
predictive distribution 

f\V,9,a 2 ~ QV (m p (x), fc p (x,x')) . (19) 

Here the mean and covariance are m p (x) = k(x, X)V logp(y | f )| f= f and k p (x, x') = k(x, x')— 
k(5t, X)(Kf f +W) _1 fc(X, x'). The approximate conditional predictive density of %ji can now 
be evaluated, for example, with quadrature integration over each fi separately 



pimp, Q, > 



p(m\fi, 4>)<i(fip, o, 4>)dfi. 



(20) 



3.3 Expectation propagation algorithm 

The Laplace method constructs a Gaussian approximation at the posterior mode and ap- 
proximates the posterior covariance via the curvature of the log density at that point. The 



expectation propagation (EP) algorithm (Minka, 2001), for its part, tries to minimize the 
Kullback-Leibler divergence from the true posterior to its approximation 



1 n 

?(f \v,e,(j>) = ^— p(f \o) [J u(f& iU, 



(21) 



where the likelihood terms have been replaced by site functions ti(fi\Zi, fy, of) = Z{ N(/j|/2j, of) 
and the normalizing constant by Zep- Detailed description of the algorithm is provided, for 
example by Rasmussen and Williams (2006) and Jylanki et al. (2011). EP's conditional 



posterior approximation is 



?(f |2>,M)=N(f |K fif (K f)f +S) 



•Kf.fCKf^+S)- 1 ^), 



(22) 



where £ = diag [of, cr„] and fx = [fix, 



The predictive mean and covariance are 



again obtained from equations (11) and (13) analogically to the Laplace approximation. 



From equations (15), (17), and (22) it can be seen that the Laplace and EP approxima- 



tions are similar to the exact solution with the Gaussian likelihood. The diagonal matrices 
W _1 and X correspond to the noise variance <r 2 I and, thus, these approximations can be 



interpreted as Gaussian approximations to the likelihood (Nickisch and Rasmussen, 2008). 



3.4 Markov chain Monte Carlo 

The accuracy of the approximations considered so far is limited by the Gaussian form of 
the approximating function. An approach, which gives exact solution in the limit of an 
infinite computational time, is the Monte Carlo integration (Robert and Casella 2004). 
This is based on sampling from p(f \T>, 9, <fi) and using the samples to represent the posterior 
distribution. 



In MCMC methods (Gilks et al. 1996), one constructs a Markov chain whose station- 



ary distribution is the posterior distribution and uses the Markov chain samples to obtain 
Monte Carlo estimates. GPstuff provides, for example, a scaled Metropolis Hastings algo- 



rithm (Neal 1998) and Hamiltonian Monte Carlo (HMC) (Duane et al. 1987 Neal 1996) 



with variable transformation discussed in (Christensen et al. 



2006 



Vanhatalo and Vehtari 



2007) to sample from p(f \T>, 9, 4>). The approximations to the conditional posterior of f are 



we can sample from the poste- 

s(0 



illustrated in Figure [T] 

After having the posterior sample of latent variables, 

rior predictive distribution of any set of f simply by sampling with each f ^ one f W from 
p(f|fW,0,<£), which is given in the equation (|7|). Similarly, we can obtain a sample of y by 

drawing one for each f from p(y\t, 9, </>). 



4. Marginal likelihood given parameters 

The marginal likelihood given the parameters, p(T>\9,(j)) = j p(y\ f , 0)p(f |X, 8)df, is an 
important quantity when inferring the parameters as discussed in the next section. With a 



Gaussian likelihood it has an analytic solution ( 14 ) which gives the log marginal likelihood 



\ogp(V\9, o) = ~\ log(27r) - i log | K f)f +d 2 I| - X - y T (K f , f +a 2 I) 



y • 



(23) 



If the likelihood is not Gaussian, the marginal likelihood needs to be approximated. The 
Laplace approximation to the marginal likelihood is constructed, for example, by making a 
second order Taylor expansion for the integrand p(y| F, (j))p(f |X, 8) around f. This gives a 
Gaussian integral over f multiplied by a constant, and results in the log marginal likelihood 




-0.8 -0.55 -0.3 

(a) Disease mapping 
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Figure 1: Illustration of the Laplace approximation (solid line), EP (dashed line) 
and MCMC (histogram) for the conditional posterior of a latent variable p(fi\T>, 0) 
in two applications. On the left, a disease mapping problem with Poisson likelihood 



(used in Vanhatalo et al. , 2010) where the Gaussian approximation works well. On 



the right, a classification problem with probit likelihood (used in Vanhatalo and 



Vehtari 2010) where the posterior is skewed and the Gaussian approximation is 
not so good. 



approximation 



logp(V\9, 0) « log q(p\9, </>) oc - l -f Kfi f f + logp(y|f , 



1 



log |B| 



(24) 



where |B| = \I + W 1 / 2 Kff W 1 / 2 ). See also (Tierney and Kadane 



1986 



Rue et al. 



2009 



Vanhatalo et al. 2009 ) for more discussion. 



EP's marginal likelihood approximation is the normalization constant 



Zep = I p(f |X, 6) 11 Zi N(/ f Srt)dfi 
i=i 



(25) 



in equation (21). This is a Gaussian integral multiplied by a constant Y\a=i giving 



logp(P|0,0)«logZ EP = -^log|^ + S|-^ T (^ + s) 'a + Cep, (26) 



where Cep collects the terms that are not explicit functions of 8 or <p (there is an implicit 
dependence through the iterative algorithm, though). For more discussion on EP's marginal 
likelihood approximation see ( Seeger 2005 Nickisch and Rasmussen 2008 ) . 



5. Marginalization over parameters 

The previous section treated methods to evaluate exactly (the Gaussian case) or approxi- 
mately (Laplace and EP approximations) the log marginal likelihood given parameters. Now, 
we describe approaches for estimating parameters or integrating numerically over them. 



5.1 Maximum a posterior estimate of parameters 

In a full Bayesian approach we should integrate over all unknowns. Given we have integrated 
over the latent variables, it often happens that the posterior of the parameters is peaked 
or predictions are unsensitive to small changes in parameter values. In such case, we can 
approximate the integral over p(6, <f>\T>) with the maximum a posterior (MAP) estimate 



{9, <f\ = argmaxp(#, (p\T>) = argmin [— log p(T>\6, <j>) — logp(6>, </>)] . 



(27) 



In this approximation, the parameter values are given a point mass one at the posterior 
mode, and the marginal of the latent function is approximated as p(f \T>) ~ p(f \T>, 6, <j>). 
The log marginal likelihood, and its approximations, are differentiable with respect to the 



parameters (Seeger 2005 Rasmussen and Williams 2006). Thus, also the log posterior is 



differentiable, which allows gradient based optimization. The advantage of MAP estimate is 
that it is relatively easy and fast to evaluate. According to our experience good optimization 
algorithms need usually at maximum tens of optimization steps to find the mode. However, 
it underestimates the uncertainty in parameters. 

5.2 Grid integration 

Grid integration is based on weighted sum of points evaluated on grid 



M 



P (f \v) «^ P (f 



(28) 



i=l 



Here i? = [# 1 ,(/> T ] T and Aj denotes the area weight appointed to an evaluation point t?j. 



The implementation follows INLA (Rue et al. 2009) and is discussed in detail by Vanhatalo 



et al. (2010). The basic idea is to first locate the posterior mode and then to explore the log 



posterior surface so that the bulk of the posterior mass is included in the integration (see 



Figure 2(a)). The grid search is feasible only for a small number of parameters since the 



number of grid points grows exponentially with the dimension of the parameter space d. 



5.3 Monte Carlo integration 

Monte Carlo integration scales better than the grid integration in large parameter spaces 



since its error decreases with a rate that is independent of the dimension (Robert and 



Casella 2004). There are two options to find a Monte Carlo estimate for the marginal 



posterior p(f\T>). The first option is to sample only the parameters from their marginal 
posterior p(i?|X>) or from its approximation (see Figure 2(b)[ ). In this case, the posterior 
marginal of the latent variable is approximated with mixture distribution as in the grid 
integration. The alternative is to sample both the parameters and the latent variables. 

The full MCMC is performed by alternate sampling from the conditional posteriors 
p(f \T>, $) and p(i?|X>, f). Possible choices to sample from the conditional posterior of the 



parameters are, e.g., HMC and slice sampling (SLS) (Neal 2003). Sampling both the param- 



eters and latent variables is usually slow since due to the strong correlation between them 
(Vanhatalo and Vehtari 2007 Vanhatalo et al. 2010). Sampling from the (approximate) 



marginal, p($|X>), is an easier task since the parameter space is smaller. 



The parameters can be sampled from their marginal posterior (or its approximation) 
with HMC, SLS (Neal 2003) or via importance sampling (Geweke, |1989 ), In importance 
sampling, we use a Gaussian or Student-t proposal distribution (7(1?) with mean 1? and 
covariance approximated with the negative Hessian of the log posterior, and approximate 
the integral with 



M 



p(f\T>) 



2>(f|Z>,0i 



(29) 



where W{ = q('&^)/g('&^) are the importance weights. In some situations the naive Gaussian 
or Student-t proposal distribution is not adequate since the posterior distribution <jr(i?|23) may 
be non-symmetric or the covariance estimate is poor. An alternative for these situations is 



the scaled Student-t proposal distribution (Geweke, 1989) which is adjusted along each main 



direction of the approximate covariance. The implementation of the importance sampling 



is discussed in detail by Vanhatalo et al. (2010) 



The problem with MCMC is that we are not able to draw independent samples from 
the posterior. Even with a careful tuning of Markov chain samplers the autocorrelation is 
usually so large that the required sample size is in thousands, which is a clear disadvantage 
compared with, for example, the MAP estimate. 



5.4 Central composite design 



Rue et al. (2009) suggest a central composite design (CCD) for choosing the representative 
points from the posterior of the parameters when the dimensionality of the parameters, d, 
is moderate or high. In this setting, the integration is considered as a quadratic design 
problem in a d dimensional space with the aim at finding points that allow to estimate the 
curvature of the posterior distribution around the mode. The design used in GPstuff is the 



fractional factorial design (Sanchez and Sanchez 2005) augmented with a center point and 



a group of 2d star points. The design points are all on the surface of a d-dimensional sphere 



and the star points consist of 2d points along each axis, which is illustrated in Figure [2(c) 



The integration is then a finite sum (28) with special weights (Vanhatalo et al. 2010) 



CCD integration speeds up the computations considerably compared to the grid search 
or Monte Carlo integration since the number of the design points grows very moderately. 
The accuracy of the CCD is between the MAP estimate and the full integration with the grid 



search or Monte Carlo. Rue et al. (2009) report good results with this integration scheme, 



and it has worked well in moderate dimensions in our experiments as well. Since CCD is 
based on the assumption that the posterior of the parameter is (close to) Gaussian, the 
densities at the points on the circumference should be monitored in order to detect 

serious discrepancies from this assumption. These densities are identical if the posterior is 
Gaussian and we have located the mode correctly, and thereby great variability on their 
values indicates that CCD has failed. The posterior of the parameters may be far from a 
Gaussian distribution but for a suitable transformation, which is made automatically in the 
toolbox, the approximation may work well. 




Figure 2: The grid based, Monte Carlo and central composite design integration. 
Contours show the posterior density g(log($)|P) and the integration points are 
marked with dots. The left figure shows also the vectors z along which the points 
are searched in the grid integration and central composite design. The integration 
is conducted over g(log("i?)|P) rather than q($\T>) since the former is closer to 



Gaussian. Reproduced from (Vanhatalo et al. 20101 



6. Getting started with GPstuff: regression and classification 
6.1 Gaussian process regression 

The demonstration program demo_regressionl considers a simple regression problem = 
/(xj) + £j, where £j ~ N(0,a 2 ). We will show how to construct the model with a squared 
exponential covariance function and how to conduct the inference. 



6.1.1 Constructing the model 

The model construction requires three steps: 1) Create structures that define likelihood and 
covariance function, 2) define priors for the parameters, and 3) create a GP structure where 
all the above are stored. These steps are done as follows: 

lik = lik_gaussian( 'sigma2' , 0.2~2); 

gpcf = gpcf _sexp( 'lengthScale' , [1.1 1.2], 'magnSigma2 ' , 0.2~2) 



pn=prior_logunif () ; 

lik = lik_gaussian(lik, 'sigma2_prior' , pn) ; 



pi = prior_unif () ; 

pm = prior_sqrtunif () ; 

gpcf = gpcf _sexp (gpcf , 'lengthScale_prior ' , pi, 'magnSigma2_prior ' , pm) ; 
gp = gp_set('lik' , lik, 'cf, gpcf); 

Here lik_gaussian initializes Gaussian likelihood function and its parameter values and 
gpcf _sexp initializes the squared exponential covariance function and its parameter values. 
lik_gaussian returns structure lik and gpcf_sexp returns gpcf that contain all the in- 
formation needed in the evaluations (function handles, parameter values etc.). The next 



five lines create the prior structures for the parameters of the observation model and the 
covariance function, which are set in the likelihood and covariance function structures. The 
last line creates the GP structure by giving it the likelihood and covariance function. 

Using the constructed GP structure, we can evaluate basic summaries such as covari- 
ance matrices, make predictions with the present parameter values etc. For example, the 
covariance matrices Kf ; f and C = Kf ; f +a 2 l for three two-dimensional input vectors are: 

example_x =[-1-1; 00; 11]; 
[K, C] = gp_trcov(gp, example_x) 
K = 



C = 
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6.1.2 MAP ESTIMATE FOR THE PARAMETERS 

gp_optim works as a wrapper for usual gradient based optimization functions. It is used as 
follows: 

opt=optimset('TolFun' ,le-3, 'TolX' ,le-3, 'Display' , 'iter') ; 
gp=gp_optim(gp,x,y, 'opt' , opt) ; 

gp_optim takes a GP structure, training input x, training target y (which are defined in 
demo_regressionl) and options, and returns a GP structure with parameter values opti- 
mized to their MAP estimate. By default gp_optim uses fminscg function, but gp_optim can 
use also, for example, fminlbfgs or fminunc. Optimization options are set with optimset 
function. It is also possible to set optimisation options as 

opt=struct('TolFun' ,le-3, 'TolX' ,le-3, 'Display' , 'iter') ; 

which is useful when using an optimiser not supported by optimset. All the estimated 
parameter values can be easily checked using the function gp_pak, which packs all the 
parameter values from all the covariance function structures in a vector, usually using log- 
transformation (other transformations are also possible). The second output argument of 
gp_pak lists the labels for the parameters: 

[w,s] = gp_pak(gp); 
disp(s), disp(exp(w)) 

' log(sexp .magnSigma2) ' 

' log(sexp . lengthScale x 2)' 

'log(gaussian.sigma2) ' 



2.5981 0.8331 0.7878 0.0427 



predicted surface 



* data point 




-2 -2 0.5 1 -1 -0.5 



(a) The predictive mean and training data. (b) The marginal posterior predictive 

distributions p(fi\T>). 

Figure 3: The predictive mean surface, training data, and the marginal posterior for 
two latent variables in demo_regressionl. Histograms show the MCMC solution 
and the grid integration solution is drawn with a line. 



It is also possible to set the parameter vector of the model to desired values using gp_unpak. 
gp_pak and gp_unpak are used internally to allow use of generic optimisation and sampling 
functions, which take the parameter vector as an input argument. 

Predictions for new locations x, given the training data (x, y), are done by gp_pred 
function, which returns the posterior predictive mean and variance for each /(x) (see equa- 
tion (|16|)). This is illustrated below where we create a regular grid where the posterior mean 
and variance are computed. The posterior mean m p (x) and the training data points are 
shown in Figure [3] 

[xt 1 , xt2] =meshgrid ( -1.8:0.1:1.8,-1.8:0.1:1.8); 
xt=[xtl(:) xt2(:)] ; 

[Eft_map, Varft_map] = gp_pred(gp, x, y, xt) ; 



6.1.3 Marginalization over parameters with grid integration 

To integrate over the parameters we can use any method described in the section [5] The 
grid integration is performed with the following line: 

[gp_array, P_TH, th, Eft_ia, Varft_ia, fx_ia, x_ia] = . . . 

gp_ia(gp, x, y, xt, ' int_method' , 'grid'); 

gp_ia returns an array of GPs (gp_array) for parameter values th with weights 

P_TH ([p( , &i\T))Ai]f / £ 1 ). Since we use the grid method the weights are proportional to the 



marginal posterior and Aj = IVi (see section 5.2). Ef _ia and Varf_ia contain the predictive 



mean and variance at the prediction locations. The last two output arguments can be used 
to plot the predictive distribution p(fi\T>) as demonstrated in Figure [3] x_ia(i , : ) contains 
a regular grid of values /j and fx_ia(i, : ) contains p(fi\T>) at those values. 



6.1.4 Marginalization over parameters with MCMC 



The main function for conducting Markov chain sampling is gp_mc, which loops through all 
the specified samplers in turn and saves the sampled parameters in a record structure. In 
later sections, we will discuss models where also latent variables are sampled, but now we 
concentrate on the covariance function parameters, which are sampled as follows: 

[gp_rec,g,opt] = gp_mc(gp, x, y, 'nsamples', 220); 
gp_rec = thin(gp_rec, 21, 2); 

[Eft_s, Varft_s] = gpmc_preds (rfull , x, y, xt) ; 
[Eft_mc, Varft_mc] = gp_pred(gp_rec , x, y, xt) ; 

The gp_mc function generates nsamples (here 220) Markov chain samples. At each iter- 
ation gp_mc runs the actual samplers. The function thin removes the burn-in from the 
sample chain (here 21) and thins the chain more (here by 2). This way we can decrease the 
autocorrelation between the remaining samples. GPstuff provides also diagnostic tools for 



Markov chains. See, for example, (Gelman et al. , 2004, Robert and Casella 2004) for discus- 



sion on convergence and other Markov chain diagnostics. The function gpmc_preds returns 
the conditional predictive mean and variance for each sampled parameter value. These are 
E p (f[x25 0l»)[f]) s = 1,—,M and Var^fix^^Jf], s = 1,...,M, where M is the number of 
samples. Marginal predictive mean and variance are computed directly with gp_pred. 



6.2 Gaussian process classification 

We will now consider a binary GP classification (see demo_classif ic) with observations, 
yi S { — 1, l},i = 1, ...,n, associated with inputs X = {x}™ =1 . The observations are consid- 
ered to be drawn from a Bernoulli distribution with a success probability p{jji = l|xj). The 
probability is related to the latent function via a sigmoid function that transforms it to a 
unit interval. GPstuff provides a probit and logit transformation, which give 

1 N(z\0,l)dz (30) 

-oo 

Plogit(yt|/(Xi)) = l + extf-w/C*))- (31) 

Since the likelihood is not Gaussian we need to use approximate inference methods, discussed 
in the section [3l 

6.2.1 Constructing the model 

The model construction for the classification follows closely the steps presented in the pre- 
vious section. The model is constructed as follows: 

lik = lik_probit () ; 

gpcf = gpcf_sexp('lengthScale' , [0.9 0.9], 'magnSigma2 ' , 10); 
gp = gp_set('lik' , lik, 'cf, gpcf, ' jitterSigma2' , le-9) ; 

The above lines first initialize the likelihood function, the covariance function and the GP 
structure. Since we do not specify parameter priors, the default priors are used. The model 



construction and the inference with logit likelihood (lik_logit) would be similar with probit 
likelihood. A small jitter value is added to the diagonal of the training covariance to make 
certain matrix operations more stable (this is not usually necessary but is shown here for 
illustration). 

6.2.2 Inference with Laplace approximation 

The MAP estimate for the parameters can be found using gp_optim as 

gp = gp_set(gp, ' latent_method' , 'Laplace'); 
gp = gp_optim(gp,x,y, 'opt' ,opt) ; 

The first line defines which inference method is used for the latent variables. It initializes the 
Laplace algorithm and sets needed fields in the GP structure. The default method for latent 
variables is Laplace, so this line could be omitted. gp_optim uses the default optimization 
function, f minscg, with the same options as above in the regression example. 

gp_pred provides the mean and variance for the latent variables (first two outputs), the 
log predictive probability for a test observation (third output), and mean and variance for 
the observations (fourth and fifth output) at test locations xt. 

[Eft_la, Varft_la, lpyt_la, Eyt_la, Varyt_la] = . . . 

gp_pred(gp, x, y, xt, 'yt', ones (size (xt , 1) , 1) ); 



The first four input arguments are the same as in the section 6.1 The fifth and sixth argu- 
ments are a parameter-value pair where yt tells that we give test observations ones (size (xt , 1) , 
related to xt as an optional input, in which case gp_pred evaluates their marginal posterior 
log predictive probabilities lpyt_la. Here we evaluate the probability to observe class 1 and 
thus we give a vector of ones as test observations. 



6.2.3 Inference with expectation propagation 

EP works as the Laplace approximation. We only need to set the latent method to 'EP' 
but otherwise the commands are the same as above: 

gp = gp_set(gp, ' latent_method' , 'EP'); 
gp = gp_optim(gp,x,y, 'opt' ,opt) ; 

[Eft_ep, Varft_ep, lpyt_ep, Eyt_ep, Varyt_ep] = . . . 

gp_pred(gp, x, y, xt, 'yt', ones (size(xt , 1) , 1) ); 

6.2.4 Inference with MCMC 

With MCMC we sample both the latent variables and the parameters: 

gp = gp_set(gp, 'latent_method' , 'MCMC, ' jitterSigma2' , le-6) ; 
[gp_rec,g,opt]=gp_mc(gp, x, y, 'nsamples', 220); 
gp_rec=thin(gp_rec,21,2) ; 

[Ef_mc, Varf_mc, lpy_mc, Ey_mc, Vary_mc] = . . . 

gp_pred(gp_rec, x, y, xt, 'yt', ones (size(xt , 1) , 1) ); 



For MCMC we need to add a larger jitter on the diagonal to prevent numerical problems. 
By default sampling from the latent value distribution p(f \9, T>) is done with the ellipti- 



cal slice sampling (Murray et al. 2010). Other samplers provided by the GPstuff are a 



scaled Metropolis Hastings algorithm (Neal 1998) and a scaled HMC scaled_hmc (Vanhat- 



alo and Vehtari, 2007). From the user point of view, the actual sampling and prediction 
are performed similarly as with the Gaussian likelihood. The gp_mc function handles the 
sampling so that it first samples the latent variables from p(f \6,T>) after which it samples 
the parameters from p{6\ i,T>). This is repeated until nsamples samples are drawn. 

In classification model MCMC is the most accurate inference method, then comes EP 
and Laplace approximation is the worst. However, the inference times line up in the opposite 
order. The difference between the approximations is not always this large. For example, 
with Poisson likelihood, discussed in the section |7.2| Laplace and EP approximations work, 
in our experience, practically as well as MCMC. 



6.2.5 Marginal posterior corrections 

GPstuff also provides methods for marginal posterior corrections using either Laplace or EP 



as a latent method (Cseke and Heskes, 2011). Provided correction methods are either cm2 



or fact for Laplace and fact for EP. 

Univariate marginal posterior corrections can be computed with function gp_predcm 
which returns the corrected marginal posterior distributtion for the given indices of in- 
put/outputs: 

[pc, fvec, p] = gp_predcm(gp,x,y, 'ind' , 1, 'fcorr', 'fact'); 

The returned value pc corresponds to corrected marginal posterior for the latent value 
fl, fvec corresponds to vector of grid points where pc is evaluated and p is the orig- 
inal Gaussian posterior distribution evaluated at fvec. Marginal posterior corrections 
can also be used for predictions in gp_pred and sampling of latent values in gp_rnd (see 
demo_improvedmarginalsl and demo_improvedmarginals2). gp_rnd uses univariate marginal 
corrections with Gaussian copula to sample from the joint marginal posterior of several latent 
values. 

7. Other single latent models 

In this section, we desribe other likelihood functions, which factorize so that each factor 
depend only on single latent value. 



7.1 Robust regression with Student-t observation model 

A commonly used observation model in the GP regression is the Gaussian distribution. 
This is convenient since the marginal likelihood is analytically tractable. However, a known 
limitation with the Gaussian observation model is its non-robustness, due which outlying 
observations may significantly reduce the accuracy of the inference (see Figure [4]). A well- 
known robust observation model is the Student-t distribution ( |Q'Hagan 1979) 

jl r((!/ + i)/2) / ( yi - fi) 2 \~ (u+1)/2 , \ 




Figure 4: An example of regression with outliers from (Neal 1997). On the left 



Gaussian and on the right the Student-t likelihood. The real function is plotted 
with black line. 



where v is the degrees of freedom and at the scale parameter. The Student- 1 distribution 
can be utilized as such or it can be written via the scale mixture representation 



Vi\fi,a,Ui ~ N(fi,aUi) 



(33) 
(34) 



where each observation has its own noise variance alii that is Inv-^ 2 distributed (Neal 



1997 



Gelman et al. 2004). The degrees of freedom v corresponds to the degrees of freedom in 
the Student-i distribution and ar corresponds to at- 

In GPstuff both of the representations are implemented. The scale mixture representa- 
tion is implemented in lik_gaussiansmt and can be inferred only with MCMC (as described 



by Neal 1998). The Student-t likelihood is implemented in lik_t and can be inferred with 



Laplace and EP approximation and MCMC (as described by Vanhatalo et al. 2009 Jylanki 
et al. 2011). These are demonstrated in demo_regression_robust. 



7.2 Count data 

Count data are observed in applications. One such common application is spatial epidemi- 
ology, which concerns both describing and understanding the spatial variation in the disease 
risk in geographically referenced health data. One of the most common tasks in spatial epi- 
demiology is disease mapping, where the aim is to describe the overall disease distribution 
on a map and, for example, highlight areas of elevated or lowered mortality or morbidity risk 
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model following the general approach discussed, for example, by Best et al. (2005). The 
data are aggregated into areas with coordinates Xj. The mortality /morbidity in an area is 
modeled with a Poisson, negative binomial or binomial distribution with mean eifii, where 
e« is the standardized expected number of cases (e.g. Ahmad et al. 2000), and /ij is the 



relative risk, whose logarithm is given a GP prior. The aim is to infer the relative risk. The 



implementation of these models in GPstuff is discussed in (Vanhatalo and Vehtari 2007 



Vanhatalo et al. 2010). 
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(a) The posterior mean. 



(b) The posterior variance. 



Figure 5: The posterior predictive mean and variance of the relative risk in the 
demo_spatiall data set obtained with FIC. 



7.2.1 POISSON LIKELIHOOD 

The Poisson likelihood is implemented in likelih_poisson and it is 

n 

y | f, e ~ JJPoisson(yi| exp(/;)e;) (35) 
i=i 

where the vector y collects the numbers of deaths for each area. Here fi = exp(/) and its 
posterior predictive mean and variance solved in the demo demo_spatiall are shown in 
Figure [5j 

7.2.2 Negative binomial likelihood 

The negative binomial distribution is a robust version of the Poisson distribution similarly 



as Student-i distribution can be considered as a robustified Gaussian distribution (Gelman 



et al. 2004 ) . In GPstuff it is parametrized as 

where //j = exp(/(xj)) and r is the dispersion parameter governing the variance. The 
model is demonstrated in demo_spatial2. 

7.2.3 Binomial likelihood 

In disease mapping, the Poisson distribution is used to approximate the true binomial dis- 
tribution. This approximation works well if the background population, corresponding to 
the number of trials in the binomial distribution, is large. Sometimes this assumption is not 
adequate and we need to use the exact binomial observation model 

n | 

y\^-U v , (z Zi L v . v pr^-p^- y ^ ( 37 ) 

i=i y % ~y 1 y*j 



where pi = exp ( / (xj ) ) / ( 1 + exp ( / (xj ) ) ) is the probability of success , and the vector z denotes 
the number of trials. The binomial observation model is not limited to spatial modeling but 
is an important model for other problems as well. The observation model is demonstrated 
in demo_binomiall with a one dimensional simulated data and demo_binomial_apc demon- 
strates the model in an incidence risk estimation. 



7.2.4 Hurdle model 



Hurdle models can be used to model excess number of zeros compared to usual Poisson and 
negative binomial count models. Hurdle models assume a two-stage process, where the first 
process determines whether the count is larger than zero, and the second process determines 



the non-zero count (Mullahy 1986). These processes factorize, and thus hurdle model can 



be implemented using two independent GPs in GPstuff. lik_probit or lik_logit can be 
used for the zero process and lik_negbinztr can be used for the count part. lik_negbinztr 
provides zero truncated negative binomial model, which can be used also to approximate 
zero-truncated Poisson model by using high dispersion parameter value. Construction of a 
hurdle model is demonstrated in demo_hurdle. Gaussian process model with logit negative 



binomial hurdle model implemented using GPstuff was used in reference (Rantonen et al. 
2011) to model sick absence days due to low back symptoms. 



7.3 Log-Gaussian Cox process 

Log-Gaussian Cox-process is an inhomogeneous Poisson process model used for point data, 
with unknown intensity function A(x), modeled with log-Gaussian process so that /(x) = 



logA(x) (see Rathbun and Cressie, 1994 M0ller et al. 1998). If the data are points X = Xj 
i = 1, 2, . . . , n on a finite region V in 5? d , then the likelihood of the unknown function / is 



p(X\f) = exp j- Qf exp(/(x))dx) + f^fi^) 



(38) 



Evaluation of the likelihood would require nontrivial integration over the exponential of GP. 



M0ller et al. (1998) propose to discretise the region V and assume locally constant intensity 



in subregions. This transforms the problem to a form equivalent to having Poisson model 
for each subregion. Likelihood after the discretisation is 



K 



P( x l/) ~ II Poisson(y fc | exp(/(x fc ))), 



(39) 



k=l 



where x is the coordinate of the kth sub-region and is the number of data points in it. 



Tokdar and Ghosh (2007) proved the posterior consistency in limit when sizes of subregions 



go to zero. 

The log-Gaussian Cox process with Laplace and EP approximation is implemented in 
the function lgcp for one or two dimensional input data. The usage of the function is 
demonstrated in demo_lgcp. This demo analyzes two data sets. The first one is one dimen- 
sional case data with coal mine disasters (from R distribution). The data contain the dates 
of 191 coal mine explosions that killed ten or more men in Britain between 15 March 1851 
and 22 March 1962. The analysis is conducted using expectation propagation and CCD 



(a) Coal mine disasters. 



(b) Redwood data. 



Figure 6: Two intensity surfaces estimated with log-Gaussian Cox process. The 
figures are from the demo_lgcp, where the aim is to study an underlying intensity 
surface of a point process. On the left a temporal and on the right a spatial point 
process. 



integration over the parameters and the results are shown in Figure |6j The second data are 
the redwood data (from R distribution). This data contain 195 locations of redwood trees 
in two dimensional lattice. The smoothed intensity surface is shown in Figure [6] 

7.4 Accelerated failure time survival models 

The accelerated failure time survival models are demonstrated in demo_survival_af t. 
7.4.1 Weibull 

The Weibull distribution is a widely used parametric baseline hazard function in survival 



analysis (Ibrahim et al. 2001). The hazard rate for the observation i is 

h i {y) = h (y)exv{f{x i )), (40) 

where y > and the baseline hazard ho(y) is assumed to follow the Weibull distribution 
parametrized in GPstuff as 

h (y)=ry r -\ (41) 



where r > is the shape parameter. This follows the parametrization used in Martino et al. 



( 20lT| ). The likelihood defined as 

n 

L = Y[ rl ~ Zl ex P ((* - + - z ^ r - lo §^) " exp(/(x,M) , (42) 



i=l 



where z is a vector of censoring indicators with Zi = for uncensored event and Zi = 1 for 
right censored event for observation i. Here we present only the likelihood function because 
we don't have observation model for the censoring. 



7.4.2 Log-Gaussian 

With Log-Gaussian survival model the logarithms of survival times are assumed to be nor- 
mally distributed. 

The likelihood is defined as 



L = 



i=i 



-(l-Zi)/2 l—Zi 

\ %)l y g X p 



log(yi) - /( x * 



a 



1 

2^2 



(i- 2i )(log(yi)-/(xi)) ; 



(43) 



where a is the scale parameter. 

7.4.3 Log-logistic 

The log-logistic likelihood is defined as 



L 



n 

i=l 



ry 



exp(/(xj)) 



1— 



1 + 



exp(/(xj)) 



(44) 



where r is the shape parameter and z\ the censoring indicators. 



7.5 Derivative observations in GP regression 

Incorporating derivative observations in GP regression is fairly straightforward, because a 
derivative of Gaussian process is a Gaussian process. In short, derivative observation are 
taken into account by extending covariance matrices to include derivative observations. This 
is done by forming joint covariance matrices of function values and derivatives. Following 
equations (Rasmussen and Williams 2006[ ) state how the covariances between function values 
and derivatives, and between derivatives are calculated 
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The joint covariance matrix for function values and derivatives is of the following form 
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Prediction is done as usual but with derivative observations joint covariance matrices are to 
be used instead of the normal ones. 



Using derivative observations in GPstuff requires two steps: when initializing the GP 
structure one must set option 'derivobs ' to 'on'. The second step is to form right sized 
observation vector. With input size n x m the observation vector with derivatives should 
be of size n + m ■ n. The observation vector is constructed by adding partial derivative 
observations after function value observations 



yobs 



y( x ) 

dxi 



dxm 



(45) 



Different noise level could be assumed for function values and derivative observations but 
at the moment the implementation allows only same noise for all the observations. The use 
of derivative observations is demonstrated in demo_derivativeobs. 

7.6 Quantile Regression 

Quantile regression is used for estimating quantiles of the response variable as a function of 



input variables (Boukouvalas et al. 2012 1. 



The likelihood is defines as 

p(y\f, o",t) = 



r(l-r) 



exp 



y-f 



a 



(r-I(y<f)) 



(46) 



where r is the quantile of interest and a is the standard deviation. I(y < f) is 1 if the 
condition inside brackets is true and otherwise. Because the logarithm of the likelihood 
is not twice differentiable at the mode, Laplace approximation cannot be used for inference 
with Quantile Regression. Quantile Regression with EP and MCMC is demonstrated in 
demo_qgp with toy data. 

8. Multilatent models 

The multilatent models consist of models where individual likelihood factors depend on 
multiple latent variables. In this section we shortly summarize such models in GPstuff. 

8.1 Multiclass classification 

In multiclass classification problems the target variables have more than two possible class 
labels, m E {1, . . . , c}, where c > 2 is the number of classes. In GPstuff, multi-class classifi- 
cation can be made either using the softmax likelihood (lik_softmax) 



p(yi\ f« 



exp(/f 



ELi e Mfi 



(47) 



where fj = [//, . . . , ff\ T , or the multinomial probit likelihood (lik_multinomialprobit) 



P(Vi\ f i 



E 



1 II '^-/f ./y 



(48) 



where the auxiliary variable Ui is distributed as p(u{) = J\f(ui\0, 1), and <I>(x) denotes the 
cumulative density function of the standard normal distribution. In Gaussian process liter- 
ature for multiclass classification, a common assumption is to introduce c independent prior 
processes that are associated with c classes (see, e.g., Rasmussen and Williams, 2006 1 . By 



assuming zero-mean Gaussian processes for latent functions associated with different classes, 
we obtain a zero-mean Gaussian prior 



P (t\x)=Ar(f\o,K), 



(49) 



where f = . . . , ff, . . . , f%, . . . , ff, . . . , /^] and K is a cn x cn block-diagonal covari- 
ance matrix with matrices K ,K 2 , . . . , K c (each of size n x n) on its diagonal. 

Inference for softmax can be made with MCMC or Laplace approximation. As de- 
scribed by Williams and Barber (1998) and Rasmussen and Williams (2006), using the 



Laplace approximation for the softmax likelihood with the uncorrelated prior processes, 
the posterior computations can be done efficiently in a way that scales linearly in c, which 
is also implemented in GPstuff. Multiclass classification with softmax is demonstrated in 
demo_multiclass. 

Inference for multinomial probit can be made with expectation propagation by using a 
nested EP approach that does not require numerical quadratures or sampling for estimation 



of the tilted moments and predictive probabilities, as proposed by Riihimaki et al. (2013). 
Similarly to softmax with Laplace's method, the nested EP approach leads to low-rank 
site approximations which retain all posterior couplings but results in linear computational 
scaling with respect to c. Multiclass classification with the multinomial probit likelihood is 
demonstrated in demo_multiclass_nested_ep. 



8.2 Multinomial 

The multinomial model (lik_multinomial) is an extension of the multiclass classifica- 
tion to situation where each observation consist of counts of class observations. Let = 
[yi,l, ■ ■ ■ ,yi,c], where c > 2, be a vector of counts of class observations related to inputs Xi 
so that 

Multinomial([pj 5 i, . . . ,p^ c ],m) (50) 



where rii = Y^j=i Vi,j- The propability to observe class j is 

exp(//) 



Pi,j 



ELi ex p(// 



(51) 



where f j = . . . , /?] . As in multiclass classification we can introduce c independent prior 
processes that are associated with c classes. By assuming zero-mean Gaussian processes for 
latent functions associated with different classes, we obtain a zero-mean Gaussian prior 



p(f\X)=AT(f\0,K), 



(52) 



where f = . . . , ff, . . . , . . . , /f , . . . , f^\ T and K is a cn x cn block-diagonal covari- 
ance matrix with matrices K ,K 2 , . . . , K c (each of size n x n) on its diagonal. This model is 



demonstrated in demo_multinomial and it has been used, for example, in (Juntunen et al 
2012). 



8.3 Cox proportional hazard model 

For the individual i, where i = 1, . .. ,n, we have observed survival time (possibly right 
censored) with censoring indicator 5i, where Si = if the ith observation is uncensored and 
Si = 1 if the observation is right censored. The traditional approach to analyze continuous 



time-to-event data is to assume the Cox proportional hazard model (Cox 1972). 

hiit) = h {t)exp(xfp), 



(53) 



where ho is the unspecified baseline hazard rate, Xi is the d x 1 vector of covariates for the 

~ of 



aji,. 



, x r 



ith patient and (3 is the vector of regression coefficients. The matrix X 
size n x d includes all covariate observations. 

The Cox model with linear predictor can be extended to more general form to enable, 



for example, additive and non- linear effects of covariates (Kneib 2006, Martino et al. 2011). 
We extend the proportional hazard model by 



hi(t) = exp(log(/i (i)) + r]i(xi)), 



(54) 



where the linear predictor is replaced with the latent predictor r/i depending on the covariates 
X{. By assuming a Gaussian process prior over rj = (j]x, . . . ,rj n ) T , smooth nonlinear effects 
of continuous covariates are possible, and if there are dependencies between covariates, GP 
can model these interactions implicitly. 



A piecewise log-constant baseline hazard (see, e.g. Ibrahim et al. 2001 Martino et al 



2011) is assumed by partitioning the time axis into K intervals with equal lengths: = 
sq < si < S2 < ■ ■ ■ < sk, where sk > V% for all i = 1, . . . , n. In the interval k (where 
k = 1, . . . , K), hazard is assumed to be constant: 



h (t) = Afc for tE(s fc _i,Sfc]. 
For the ith individual the hazard rate in the A;th time interval is then 

hi(t) = exp(/ fc + rji(xi)), t G (s fe _i, s k ], 



(55) 



(56) 



where = log(Afc). To assume smooth hazard rate functions, we place another Gaussian 
process prior for / = (/i, . . . , /k) T - We define a vector containing the mean locations of K 
time intervals as r = (n, . . . , tk) T ■ 

The likelihood contribution for the possibly right censored ith observation (yi,5i) is 
assumed to be 

U = hi{ yi )^~^ exp {- P hi(t)dtj . (57) 

Using the piecewise log-constant assumption for the hazard rate function, the contribution 
of the observation i for the likelihood results in 

(fc-i 
-[(Vi ~ Sfc-i)A fe + ^2(s g - s g -i)\g] exp(r]i) ) , (58) 



where m £ (sk-i,Sk] (Ibrahim et al. 2001 Martino et al. 2011). By applying the Bayes 
theorem, the prior information and likelihood contributions are combined, and the posterior 



distribution of the latent variables can be computed. Due to the form of the likelihood 
function, the resulting posterior becomes non-Gaussian and analytically exact inference is 
intractable. GPstuff supports MCMC and Laplace approximation to integrate over the latent 
variables. The use of Gaussian process Cox proportional hazard model is demonstrated in 
demo_survival_coxph. The Gaussian process Cox proportional hazard model implemented 



with GPstuff was used in reference (Joensuu et al. 2012) to model risk of gastrointestinal 
stromal tumour recurrence after surgery. 

8.4 Zero-inflated negative binomial 

Zero-inflated negative binomial model is suitable for modelling count variables with excessive 
number of zero observations compared to usual Poisson and negative binomial count models. 
Zero-inflated negative binomial models assume a two-stage process, where the first process 
determines whether the count is larger than zero, and the second process determines the 



non-zero count (Mullahy 1986). In GPstuff, these processes are assumed independent a 



priori, but they become a posteriori dependent through the likelihood function 

p+(l — p)NegBm(y\y = 0), when y = (59) 
(1 -p)NegBin(y|y > 0), when y > 0, (60) 

where the probability p is given by a binary classifier with logit likelihood. NegBin is the 
Negative-binomial distribution parametrized for the i'th observation as 

y»!r(r) \r + m) \r + mj 

where m = ejexp(/(xj)) and r is the dispersion parameter governing the variance. In 
GPstuff, the latent value vector f = [f ^ f^"] 7 " has length 2iV, where N is the number of 
observations. The latents fi are associated with the classification process and the latents f2 
with the negative-binomial count process. 

8.5 Density estimation and regression 

Logistic Gaussian process can be used for flexible density estimation and density regression. 
GPstuff includes implementation based on Laplace (and MCMC) approximation as described 



in (Riihimaki and Vehtari 2012). 



The likelihood is defined as 

p(x|f)= n E » ri exp ( /,)- (62) 

For the latent function /, we assume the model /(x) = g(x) + h(x) T /3, where the GP prior 
g(x) is combined with the explicit basis functions h(x). Regression coefficients are denoted 
with (3, and by placing a Gaussian prior j3 ~ A/"(b, B) with mean b and covariance B, the 
parameters (3 can be integrated out from the model, which results in the following GP prior 
for /: 

/(x) ~ GV (h(x) T b, k(x, x) + h(x) T 5 h(x')) . (63) 
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(a) Galaxy data. (b) Old faithful data. 

Figure 7: Density estimates for two different distributions. Figures are from 
demo_lgpdens. On the left blue line is density estimation of Galaxy data while 
gray area is 95% mean region and on the right is 2D density estimation with coun- 
tours based on Old faithful data. 



For the explicit basis functions, we use the second-order polynomials which leads to a GP 
prior that can favour density estimates where the tails of the distribution go eventually to 
zero. 

We use finite-dimensional approximation and evaluate the integral and the Gaussian 
process in a grid. We approximate inference for logistic Gaussian process density estimation 
in a grid using Laplace's method or MCMC to integrate over the non-Gaussian posterior 



distribution of latent values (see details in Riihimaki and Vehtari 2012) 



Logistic Gaussian processes are also suitable for estimating conditional densities p(t\ x), 
where t is a response variable. We discretize both input and target space in a finite region 
to model the conditional densities with the logistic GP. To approximate the resulting non- 
Gaussian posterior distribution, we use again Laplace's method. 

ID and 2D density estimation and density regression are implemented in the function 
lgpdens. The usage of the function is demonstrated in demo_lgpdens. 

8.6 Input dependent models 

In input-dependent models more than one latent variables are used for modeling different 
parameters of the distribution. Additional latent variables in turn enable modeling depen- 
dencies between inputs and parameters. 

8.6.1 Input-dependent Noise 



Input-dependent noise model (Goldberg et al. 1997) can be used to model regression data. 



The model assumes Gaussian distributed data with latent variables defining both mean and 
variance of the Gaussian distribution instead of just the mean like in the standard Gaussian 



likelihood. Setting independent GP priors for the two latent variables, it is possible to infer 
non-constant noise, e.g. input-dependent, in the data. 
The likelihood is defined as 

n 

p(y | f« f < 2 >, a 2 ) = H N( Vi \fP,a 2 ejq>(// 2) )), (64) 
i=i 

with latent function defining the mean and /( 2 ) defining the variance. Input-dependent 
noise is demonstrated in demo_inputdependentnoise with both heteroscedastic and con- 
stant noise. 

8.6.2 Input-dependent overdispersed Weibull 

In input-dependent Weibull model additional latent variable is used for modeling shape 
parameter r of the standard Weibull model. 
The likelihood is defined as 

p(y | f «, ff \ z) = f[ exp(/f V"^ exp ((1 - Zi )f[ l) (65) 

i=l 

+ (1 - Zi)(exp(/f } ) - l)log(y,)exp(/( 1) )yf p( ^ )) ), 

where z is a vector of censoring indicators with z% = for uncensored event and z% = 1 for 
right censored event for observation i. Latent variable fi is used for modeling the shape 
parameter here. Because the dispersion parameter is constrained to be greater than 0, 
it must be computed through exp(/2) to ensure positiveness. Input-dependent Weibull is 
demonstrated in demo_inputdependentweibull. 



9. Mean functions 

In the standard GP regression a zero mean function is assumed for the prior process. This 
is convenient but there are nonetheless some advantages in using a specified mean function. 
The basic principle in doing GP regression with a mean function is to apply a zero mean 
GP for the difference between observations and the mean function. 



9.1 Explicit basis functions 



Here we follow closely the presentation of (Rasmussen and Williams 2006) about the subject 
and briefly present the main results. A mean function can be specified as a weighted sum 
of some basis functions h 

m = h(x) T /3, 



with weights (3. The target of modeling, the underlying process g, is assumed to be a sum 
of mean function and a zero mean GP. 



g = h(x)'/3 + gp(0,K). 



By assuming Gaussian prior for the weights j3 ~ A/"(b,B), the weight parameters can be 
integrated out and the prior for g is another GP 



prior g ~ gV ( h(x) T b, K + h(x) T B h(x) 



The predictive equations are obtained by using the mean and covariance of the prior (66) 
in the zero mean GP predictive equations ( 15 ) 

E(f*) + R T /3, 
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Cov(f*) + R T (B- 1 + HK^H 1 ) R, 
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(66) 
(67) 



H* - HK y K, 



, x is row vector. 



If the prior assumptions about the weights are vague then B 1 -> O, (O is a zero matrix) 



(68) 
(69) 



and the predictive equations ( |66| ) and (67) don't depend on b or B 

= E(f„) + R T ft,, 
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Corresponding the exact and vague prior for the basis functions' weights there are two 
versions of the marginal likelihood. With exact prior the marginal likelihood is 



logp(y |X,b,B) 
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where n is the amount of observations. Its derivative with respect to hyperparameters are 
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With a vague prior the marginal likelihood is 



logp(y | X) 
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where m is the rank of H T . Its derivative is 
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(70) 



where has been used the fact that matrices K^ 1 , and A^ are symmetric. The above 
expression (70) could be simplified a little further because y T PG + G T Py = 2 • y T PG. 
The use of mean functions is demonstrated in demo_regression_meanf . 



10. Sparse Gaussian processes 

The evaluation of the inverse and determinant of the covariance matrix in the log marginal 
likelihood (or its approximation) and its gradient scale as 0(n 3 ) in time. This restricts the 
implementation of GP models to moderate size data sets. However, the unfavorable scaling 
in computational time can be alleviated with sparse approximations to GP and compactly 
supported (CS) covariance functions. The sparse approximations scale as 0(nm 2 ), where 
m < n and the CS covariance functions lead to computations which scale, in general, 
as 0(n 3 ) but with a smaller constant than with traditional globally supported covariance 
functions. Many sparse GPs are proposed in the literature and some of them are build into 
GPstuff. 



10.1 Compactly supported covariance functions 



A CS covariance function gives zero correlation between data points whose distance exceeds 
a certain threshold. This leads to a sparse covariance matrix. The challenge with construct- 
ing CS covariance functions is to guarantee their positive definiteness and much literature 



has been devoted on the subject (see e 


g. Sanso and Schuh 


1987 


Wu 


1995 


Wendland 


1995 


Gaspari and Cohn 1999 


Gneiting 


1999, 


2002 


Buhmann 


2001 


The CS functions im- 


plemented in GPstuff are Wendland's piecewise polynomials k PPtq (Wendland 


2005 


), such 



as 



W = "f (! - r )+ +2 (d 2 + ^ + ^ + (3j + 6)r + 3) , (71) 



where j = [d/2\ +3. These functions correspond to processes that are q times mean square 
differentiable and are positive definite up to an input dimension d. Thus, the degree of 
the polynomial has to be increased alongside the input dimension. The dependence of CS 
covariance functions to the input dimension is very fundamental. There are no radial CS 
functions that are positive definite on every K d 



Wendland 



1995 



theorem 9.2). 



(see e.g. 

The key idea with CS covariance functions is that, roughly speaking, only the nonzero 
elements of the covariance matrix are used in the calculations. This may speed up the 
calculations substantially since in some situations only a fraction of the elements of the 
covariance matrix are non-zero 



(see e.g. 



Vanhatalo and Vehtari, 2008 Rue et al. 2009). In 



practice, efficient sparse matrix routines are needed (Davis 2006). These are nowadays a 
standard utility in many statistical computing packages or available as an additional package 
for them. GPstuff utilizes the sparse matrix routines from SuiteSparse written by Tim Davis 



(http://www.cise.ufl.edu/research/sparse/SuiteSparse/) and this package should be 



installed before using CS covariance functions. 

The CS covariance functions have been rather widely studied in the geostatistics applica- 



Gneiting 


2002, 


Furrer et al. 


2006 


Moreaux 



speed-up relies on efficient linear solvers for the predictive mean [f] = K| f (K^f +a ) y. 
The parameters are either fitted to the empirical covariance, optimized using a line search 



in one dimension (Kaufman et al. 2008) or sampled with Metropolis Hastings. The benefits 
from a sparse covariance matrix have been immediate since the problems collapse to solving 
sparse linear systems. However, utilizing the gradient of the log posterior of the parameters, 
as is done in GPstuff needs extra sparse matrix tools. These are introduced and discussed 
by Vanhatalo and Vehtari (2008). EP algorithm requires also special considerations with 



CS covariance functions. The posterior covariance in EP (22) does not remain sparse, and 
thereby it has to be expressed implicitly. This issue is discussed by | Vanhatalo and Vehtari 
(20101) andlVanhatalo et all (|2010|). 



10.1.1 Compactly supported covariance functions in GPstuff 

The demo demo_regression_ppcs contains a regression example with CS covariance func- 
tion k pp< 2 (gpcf _ppcs2). The data contain US annual precipitation summaries from year 
1995 for 5776 observation stations. The user interface of GPstuff makes no difference be- 
tween globally and compactly supported covariance functions but the code is optimized to 
use sparse matrix routines whenever the covariance matrix is sparse. Thus, we can construct 
the model, find the MAP estimate for the parameters and predict to new input locations in 
a familiar way: 

pn = prior_t( 'nu' , 4, 's2', 0.3); 

lik = lik_gaussian( 'sigma2' , 1, ' sigma2_prior ' , pn) ; 
pl2 = prior_gamma( ' sh' , 5, 'is', 1); 
pm2 = prior_sqrtt ( 'nu' , 1, 's2', 150); 

gpcf 2 = gpcf _ppcs2( 'nin' , nin, 'lengthScale' , [1 2], 'magnSigma2 ' , 3, ... 

'lengthScale_prior ' , pl2, 'magnSigma2_prior ' , pm2) ; 
gp = gp_set ( ' lik' , lik, 'cf, gpcf 2, ' j itterSigma2 ' , le-6) ; 
gp = gp_optim(gp,x,y, 'opt' ,opt) ; 
Eft = gp_pred(gp, x, y, xx) ; 
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(a) The nonzero ele- 
ments of Kf f . 



(b) The posterior predictive mean surface. 



Figure 8: The nonzero elements of Kf f with fc pP) 2 function, and the posterior 
predictive mean of the latent function in the US precipitation data set. 



With this data the covariance matrix is rather sparse since only about 5% of its elements are 
non-zero. The structure of the covariance matrix is plotted after the approximate minimum 



degree (AMD) permutation (Davis 2006) in Figured] The demo demo_spatial2 illustrates 



the use of compactly supported covariance functions with a non-Gaussian likelihood. 



10.2 FIC and PIC sparse approximations 



Snelson and Ghahramani (2006) proposed a sparse pseudo-input GP (SPGP), which Quihone 



Candela and Rasmussen (2005 



named later fully independent conditional (FIC). The par- 



tially independent conditional (PIC) sparse approximation is an extension of FIC (Quihonero- 



Candela and Rasmussen 



2005 



here following Quihonero-Candela and Rasmussen (2005). See also (Vanhatalo and Vehtari 



Snelson and Ghahramani 2007), and they are both treated 



2008 Vanhatalo et al. 2010) for further discussion. The approximations are based on in- 
troducing an additional set of latent variables u = {ui} r j ^ =1 , called inducing variables. These 
correspond to a set of input locations X u , called inducing inputs. The latent function prior 
is approximated as 



p(f |X, 9) « q(f |X, X u , 9) = J g(f |X, X u , u, 9)p(u |X„, 0)du, 



(72) 



where q({ |X, X n , u, 9) is an inducing conditional. The above decomposition leads to the 
exact prior if the true conditional f |X, X u ,u, 9 ~ N(Kf u u, Kff — Kf jU K^ 1 U K Uj f) is 
used. However, in FIC framework the latent variables are assumed to be conditionally 
independent given u, in which case the inducing conditional factorizes q(f |X, X u , u, 9) = 
Q <7j(/j|X, X u , u, 9). In PIC latent variables are set in blocks which are conditionally inde- 
pendent of each others, given u, but the latent variables within a block have a multivariate 
normal distribution with the original covariance. The approximate conditionals of FIC and 
PIC can be summarized as 



g(f |X, X u , u, 9, M) = N(f | K fjU K£ u u, mask (K f , f -K f)U K^ u K u , f |M)), 



(73) 



where the function A = mask (-|M), with matrix M of ones and zeros, returns a matrix A of 

1 and Ay = otherwise. An approximation with 



size M and elements A 



'.i 



]ij if My 



M = I corresponds to FIC and an approximation where M is block diagonal corresponds 
to PIC. The inducing variables are given a zero-mean Gaussian prior u \8, X u ~ N(0, K UjU ) 
so that the approximate prior over latent variables is 

q(f |X, X u , 9, M) = N(f |0, K f)U K^ u K Ujf + A). (74) 

The matrix Kf ]U K^ u K u> f is of rank m and A is a rank n (block) diagonal matrix. The 
prior covariance above can be seen as a non-stationary covariance function of its own where 
the inducing inputs X u and the matrix M are free parameters similar to parameters, which 



can be optimized alongside 9 (Snelson and Ghahramani 2006 Lawrence 2007). 



The computational savings are obtained by using the Woodbury-Sherman-Morrison 



lemma (e.g. Harville 1997) to invert the covariance matrix in (74) as 

(K f , u K£ u K U)f + A)- 1 = A" 1 - VV T , 



(75) 



where V = A -1 Kf >u chol[(K U)U + K Uj f A -1 Kf iU ) -1 ]. There is a similar result also for the 
determinant. With FIC the computational time is dominated by the matrix multiplications, 
which need time 0(m 2 n). With PIC the cost depends also on the sizes of the blocks in A. 
If the blocks were of equal size b x b, the time for inversion of A would be 0(n/b x 6 3 ) = 
0(nb 2 ). With blocks at most the size of the number of inducing inputs, that is b = m, the 
computational cost in PIC and FIC are similar. PIC approaches FIC in the limit of a block 



size one and the exact GP in the limit of a block size n (Snelson 2007). 



10.2.1 FIC SPARSE APPROXIMATION IN GPSTUFF 



The same data that were discussed in the section 6.1 is analyzed with sparse approximations 



in the demo demo_regression_sparsel. The sparse approximation is a property of the GP 
structure and we can construct the model similarly to the full GP models: 



gp_fic = gp_set('type', 'FIC, 'lik\ lik, 'cf>, gpcf, 'X_u' , X_u) 

The difference is that we have to define the type of the sparse approximation, here ' FIC ' , and 
set the inducing inputs X_u in the GP structure. Since the inducing inputs are considered 
as extra parameters common to all of the covariance functions (there may be more than 
one covariance function in additive models) they are set in the GP structure instead of 
the covariance function structure. If we want to optimize the inducing inputs alongside 
the parameters they need to have a prior as well. GP structure initialization gives them a 
uniform prior by default. 

We can optimize all the parameters, including the inducing inputs, as with a full GP. 



Sometimes, for example in spatial problems, it is better to fix the inducing inputs (Vanhatalo 



et al. 2010) or it may be more efficient to optimize the parameters and inducing inputs sep- 



arately, so that we iterate the separate optimization steps until convergence (demonstrated 
in demo_regression_sparsel). The parameters to be optimized are defined by the field 
inf er_params in the GP structure. This field regulates which parameters are considered 
fixed and which are inferred in the group level (covariance function, inducing inputs, like- 
lihood). We may also want to fix one of the parameters inside these groups. For example, 
covariance function magnitude. If this is the case, then the parameter to be fixed should be 
given an empty prior (prior_f ixed). If the parameter has a prior structure it is an indicator 
that we want to infer that parameter. 



10.2.2 PIC SPARSE APPROXIMATION IN GPSTUFF 



In PIC, in addition to defining the inducing inputs, we need to appoint every data point in a 
block. The block structure is common to all covariance functions, similarly to the inducing 
inputs, for which reason the block information is stored in the GP structure. After the 
blocking and initialization of the inducing inputs the GP structure is constructed as follows: 

gp_pic = gp_set ( 'type ' , 'PIC , 'lik' , lik, 'cf ' ,gpcf , 'X_u' ,X_u, 'tr_index' , tr index) ; 

Above, the cell array trindex contains the block index vectors for training data. It means 
that, for example, the inputs and outputs x(trindex{i} , : ) and y (trindex{i} , : ) belong 
to the i'th block. The optimization of parameters and inducing inputs is done the same way 
as with FIC or a full GP model. In prediction, however, we have to give one extra input, 
tst index, for gp_pred. This defines how the prediction inputs are appointed in the blocks 
in a same manner as trindex appoints the training inputs. 

Eft_pic = gp_pred(gp_pic, x, y, xt, 'tstind' , tstindex) ; 

10.3 Deterministic training conditional, subset of regressors and variational 
sparse approximation 



The deterministic training conditional is based on the works by Csato and Opper (2002) 



and Seeger et al. (2003) and was earlier called Projected Latent Variables (see Quihonero 
Candela and Rasmussen 2005 for more details). The approximation can be constructed 



similarly as FIC and PIC by defining the inducing conditional, which in the case of DTC is 



q(f |X, X u , u, 9) = N(f | K f)U K^ u u, 0). 
This implies that the approximate prior over latent variables is 

q{f |X, X u , 9) = N(f |0, K fjU K"* K u>f ). 



(76) 



(77) 



The deterministic training conditional is not strictly speaking a proper GP since it uses 
different covariance function for the latent variables appointed to the training inputs and 
for the latent variables at the prediction sites, /. The prior covariance for / is the true 
covariance instead of K U 1 U K u ?. This does not affect the predictive mean since 

the cross covariance Cov[f, f] = Kf jU K"^ K u j-, but it gives a larger predictive variance. 
An older version of DTC is the subset of regressors (SOR) sparse approximation which 
utilizes Kj K„ U K However, this resembles a singular Gaussian distribution and thus 
the predictive variance may be negative. DTC tries to fix this problem by using K? ? (see 



Quinonero-Candela and Rasmussen 2005). DTC and SOR are identical in other respects 
than in the predictive variance evaluation. In spatial statistics, SOR has been used by 



Banerjee et al. (2008) with a name Gaussian predictive process model. 



The approximate prior of the variational approximation by Titsias (2009) is exactly 



the same as that of DTC. The difference between the two approximations is that in the 
variational setting the inducing inputs and covariance function parameters are optimized 
differently. The inducing inputs and parameters can be seen as variational parameters that 



should be chosen to maximize the variational lower bound between the true GP posterior 
and the sparse approximation. This leads to optimization of modified log marginal likelihood 



V(9, X u ) = log[iV(y |0, a 2 I + Q fjf )] - ^2 tr(K f , f - K fjU K£ u K Ujf ) (78) 

with Gaussian likelihood. With non-Gaussian likelihood, the variational lower bound is 
similar but a 2 I is replaced by W _1 (Laplace approximation) or X! (EP). 



10.3.1 Variational, DTC and SOR sparse approximation in GPstuff 

The variational, DTC and SOR sparse approximations are constructed similarly to FIC. 
Only the type of the GP changes: 



gp_var = gp_set ( 'type ' , 'VAR' 
gp_dtc = gp_set('type' , 'DTC 
gp_sor = gp_set ( 'type ' , 'SOR' 



'lik' . 
'lik' . 
'lik' 



lik, 
lik, 
lik, 



'cf ' 
'cf ' 
'cf ' 



gpcf , 
gpcf , 
gpcf , 



'X_u' 
'X_u' 
'X u' 



X_u); 
X_u); 
X_u); 



The sparse GP approximations are compared in the demo demo_regression_sparse2, and, 



for example, Quihonero-Candela and Rasmussen (2005), Snelson (2007), Titsias (2009), and 



Alvarez et al. (2010) treat them in more detail. 



10.4 Sparse GP models with non-Gaussian likelihoods 

The extension of sparse GP models to non-Gaussian likelihoods is straightforward in GPstuff. 
User can define the sparse GP just as described in the previous two sections and then continue 
with the construction of likelihood exactly the same way as with a full GP. The Laplace 
approximation, EP and integration methods can be used with the same commands as with 
full GP. This is demonstrated in demo_spatiall. 



10.5 Predictive active set selection for classification 



Predictive active set selection for Gaussian processes (PASS-GP, Henao and Winther , 2012 ) 
is a sparse approximation for classification models. PASS-GP uses predictive distributions 
to estimate which data points to include in the active set, so that it will include points with 
potentially high impact to the classifier decision process while removing those that are less 
relevant. Function passgp accepts classification GP with latent method set to EP or Laplace, 
and selects the active set and optimises the hyperparameters. Its use is demonstrated in 
demo_passgp. 



11. Modifying the covariance functions 
11.1 Additive models 

In many practical situations, a GP prior with only one covariance function may be too re- 
strictive since such a construction can model effectively only one phenomenon. For example, 
the latent function may vary rather smoothly across the whole area of interest, but at the 
same time it can have fast local variations. In this case, a more reasonable model would 
be /(x) = <?(x) + /i(x), where the latent value function is a sum of two functions, slow and 



fast varying. By placing a separate GP prior for both of the functions g and h we obtain an 
additive prior 

/(x)|0 ~ gv(o,k g (*>*) + Mx,x'))- (79) 

The marginal likelihood and posterior distribution of the latent variables are as before with 
Kf f = K g g + Kh,h- However, if we are interested on only, say, phenomenon g, we can 
consider the h part of the latent function as correlated noise and evaluate the predictive 
distribution for g, which with the Gaussian likelihood would be 



g(x)\V, 9 ~ QV (USt, X)(K f f +a 2 iy l y, k t 



fx, x) 



(x,X)(K fjf +a 2 I)- 1 fc 9 (X,x / )). 



(80) 

With non-Gaussian likelihood, the Laplace and EP approximations for this are similar since 
only <t 2 I and (K^f +o" 2 I) _1 y change in the approximations. 

The multiple length-scale model can be formed also using specific covariance functions. 
For example, a rational quadratic covariance function (gpcf_rq) can be seen as a scale 



mixture of squared exponential covariance functions (Rasmussen and Williams, 2006), and 



could be useful for data that contain both local and global phenomena. However, using 
sparse approximations with the rational quadratic would prevent it from modeling local 



phenomena. The additive model (79) suits better for sparse GP formalism since it enables 



to combine FIC with CS covariance functions. 



As discussed in section 10.2 FIC can be interpreted as a realization of a special kind of 



covariance function. Vanhatalo and Vehtari (2008) proposed to add FIC with CS covariance 



function which leads to a latent variable prior 



f | X, X u , 9 ~ N(0, K f , u IT* K u , f +A), 



(81) 



referred as CS+FIC. Here, the matrix A = A +&; pPjq (X, X) is sparse with the same sparsity 
structure as in /c PP) q(X, X) and it is fast to use in computations and cheap to store. 



11.1.1 Additive models in GPstuff 

The additive models are demonstrated in demo_periodic and demo_regression_additivel. 
Their construction follows closely the steps introduced in the previous sections. Consider 
that we want to construct a GP with a covariance function that is a sum of squared expo- 
nential and piecewise polynomial fc se (x, x') + k pp ^(x, x'). In GPstuff this is done as follows: 

gpcfl = gpcf_sexp(); 

gpcf2 = gpcf _ppcs2( 'nin' , 2); 

lik = lik_gaussian( 'sigma2' , 0.1); 

gp = gp_set('lik' , lik, 'cf, {gpcfl, gpcf 2}) ; 

The difference to previous models is that we give more than one covariance function structure 
in cell array for the gp_set. The inference with additive model is conducted the same way as 
earlier. If we want to make predictions for the two components k se and /c pPi 2 independently 
we can do it by giving an extra parameter-value pair for the gp_pred as 

[Ef_sexp, Varf_sexp] = gp_pred(gp, x, y, x, 'predcf, 1); 
[Ef_ppcs2, Varf_ppcs2] = gp_pred(gp, x, y, x, 'predcf, 2); 



Additive models are constructed analogously with sparse approximations by changing 
the type of the GP model. CS+FIC is a special kind of GP structure and has its own type 
definition: 



gp = gp_set('type', 'CS+FIC, 'lik', lik, >cf>, {gpcfl, gpcf2}, 'X_u>, Xu) ; 

It is worth mentioning few things that should be noticed in relation to sparse approxi- 
mations. In general, FIC is able to model only phenomena whose length-scale is long enough 
compared to the distance between adjacent inducing inputs. PIC on the other hand is able 
to model also fast varying phenomena inside the blocks. Its drawback, however, is that the 
correlation structure is discontinuous which may result in discontinuous predictions. The 
CS+FIC model corrects these deficiencies. In FIC and PIC the inducing inputs are pa- 
rameters of every covariance function, which means that all the correlations are induced 
through the inducing inputs and the shortest length-scale the GP is able to model is defined 
by the locations of the inducing inputs. In CS+FIC, the CS covariance functions do not 
utilise inducing inputs but evaluate the covariance exactly for which reason both the long 
and short length-scale phenomena can be captured. If there are more than two covariance 
functions in CS+FIC all the globally supported functions utilize inducing inputs and all the 



CS functions are added to A. A detailed treatment on the subject is given in (Vanhatalo 



and Vehtari, 2008 2010 Vanhatalo et al. 2010) 



11.2 Additive covariance functions with selected variables 

In the demo (demo_regression_additive2), we demonstrate how covariance functions can 
be modified so that they are functions of only a subset of inputs. We model an artificial 2D 
regression data with additive covariance functions that use only either the first or second 
input variable. That is, the covariance is k\(x±, + k%(x2, ^2^2) : S 2 x S 2 4 R, where 

the covariance functions are of type ki(x±, x'i\0i) ■ Kx9i 1— > 3ft. In the example the covariance 
is a sum of a squared exponential, which is a function of the first input dimension x\, and 
a linear covariance function, which is a function of the second input dimension X2- The 
covariance functions are constructed as follows: 

gpcf_sl = gpcf_sexp('selectedVariables' , 1, 'lengthScale' ,0.5, ... 

'magnSigma2' , 0.15); 
gpcf_12 = gpcf_linear('selectedVariables' , 2); 

The smaller set of inputs is chosen with the field input variable pair ' selectedVariables ' , 1. 



11.3 Product of covariance functions 

A product of two or more covariance functions ki(x,x') ■ /^(x, x')... is a valid covariance 
function as well. Such constructions may be useful in situations where the phenomenon is 
assumed to be separable. Combining covariance functions into product form is done with a 
special covariance function gpcf _prod. For example, multiplying exponential and Matern 
covariance functions to produce separable spatio-temporal model /c(x, x') = k\ (xi,Xi) ■ 
k2{[x2, X3] T , [x' 2 , x^) where the temporal component has covariance function k\ and the 
spatial components k2 is done as follows: 



gpcfl = gpcf_exp( 'selectedVariables' , 1); 

gpcf2 = gpcf _matern32 (' selectedVariables ' , [2 3]; 

gpcf = gpcf _prod( 'cf ' , {gpcfl, gpcf2}); 

The product covariance gpcf _prod can also be used to combine categorical covariance 
gpcf _cat with other covariance functions to build hierarchical linear and non-linear models, 
as illustrated in demo_regression_hier. 



12. Model assessment and comparison 

There are various means to assess the goodness of the model and its predictive perfor- 
mance (for extensive review, see Vehtari and Ojanen, 2012). GPstuff provides four basic 
approaches: marginal likelihood, cross-validation, deviance information criterion (DIC) and 
widely applicable information criterion (WAIC). 



12.1 Marginal likelihood 



Marginal likelihood is often used for model selection (see, e.g. Kass and Raftery, 1995). It 
corresponds to ML II or with model priors to MAP II estimate in the model space, selecting 
the model with the highest marginal likelihood or highest marginal posterior probability. 
In GPstuff the marginal posterior and marginal likelihood (or its approximation in case of 
non-Gaussian likelihood) given the parameters are computed by function gp_e. 



12.2 Cross-validation 



Cross-validation (CV) is an approach to estimate the predictive performance for the future 
observations while avoiding the double use of the data. The ith observation (xj,y,) in the 
training data is left out, and then the predictive distribution for yi is computed with a model 
that is fitted to all of the observations except (xj,yj). By repeating this for every point in 
the training data, we get a collection of leave-one-out cross-validation (LOO-CV) predictive 
densities. 

For GP with given hyper parameters the LOO-CV predictions can be computed in case 



of Gaussian likelihood using an analytical solution (Sundararajan and Keerthi, 2001) and 



in case of non-Gaussian likelihood using EP approximation (Opper and Winther 2000 



Rasmussen and Williams 2006 ) or Laplace approximation using linear response approach (in 



preparation), which have been implemented in gp_loopred. If set of hyperparameters have 
been obtained using gp_mc or gp_ia LOO-posterior can be approximated using importance 



sampling (Gelfand et al. 1992 Vehtari and Lampinen, 2002), which is also implemented in 



gp_loopred. 

To reduce computation time, in /c-fold-CV, 1/k part of the training data is left out, 
and then the predictive distribution is computed with a model that is fitted to all of the 
observations except that part. Since the fe-fold-CV predictive densities are based on smaller 
training data sets than the full data set, the estimate is slightly biased. In GPstuff first 



order bias correction proposed by Burman (1989) is used. GPstuff provides gp_kfcv, which 



computes fc-fold-CV and bias-corrected fc-fold-CV with log-score and root mean squared 
error (RMSE). The function gp_kf cv provides also basic variance estimates for the predictive 
performance estimates. Variance is computed from the estimates for each fold (see, e.g., 



Dietterich 1998). See (Vehtari and Lampinen 2002) for more details on estimating the 
uncertainty in performance estimates. 



12.3 DIC 

Deviance information criterion (DIC) is another very popular model selection criterion 



( Spiegelhalter et al. 2002). DIC is not fully Bayesian approach as it estimates the pre- 



dictive performance if the predictions were made using point-estimate (plug-in estimate) for 
the unknowns, instead of using posterior predictive distribution. Additionally DIC is defined 
only for regular models and thus fails for singular models. DIC is included in GPstuff to 
allow comparisons and better alternative WAIC is described in the next section. 
With parametric models without any hierarchy DIC is usually written as 



PeS = E ei v[D(y,9)]-D(y,E elv [e]) 
DIC = E 0p [£>(y,0)]+p effj 



(82) 
(83) 



where p e s is the effective number of parameters and D = — 21og(p(y \9)) is the deviance. 



Since our models are hierarchical we need to decide the parameters on focus (see Spiegelhalter 
et al. 2002 for discussion on this). The parameters on the focus are those over which the 



expectations are taken when evaluating the effective number of parameters and DIC. In the 
above equations, the focus is in the parameters and in the case of a hierarchical GP model 
of GPstuff the latent variables would be integrated out before evaluating DIC. If we have a 
MAP estimate for the parameters, we may be interested to evaluate DIC statistics with the 
focus on the latent variables. In this case the above formulation would be 



p D {9) = E w [ J D(y,f)] - £>(y,E w [f]) 
DIC = E f |^[L>(y,f)]+^(0). 



(84) 
(85) 



Here the effective number of parameters is denoted differently with Pd{9) since now we are 
approximating the effective number of parameters in f conditionally on 9, which is different 
from the p e g. Pd(9) is a function of the parameters and it measures to what extent the 
prior correlations are preserved in the posterior of the latent variables given 9. For non- 
informative data pd{9) = and the posterior is the same as the prior. The greater pd{9) is 
the more the model is fitted to the data and large values compared to n indicate potential 
overfit. Also, large Pd{9) indicates that we cannot assume that the conditional posterior 
approaches normality by central limit theorem. Thus, Pd(9) can be used for assessing the 
goodness of the Laplace or EP approximation for the conditional posterior of the latent 



variables as discussed by Rue et al. (2009) and Vanhatalo et al. (2010). The third option is 



to evaluate DIC with focus on all the variables, [f, 9]. In this case the expectations are over 

P (f,e\v). 



12.4 WAIC 



Watanabe (2009 2010a[b) presented widely applicable information criterion (WAIC) and 



gave a formal proof of its properties as an estimate for the predictive performance of posterior 
predictive distributions for both regular and singular models. A criterion of similar form 



was independently proposed by Richardson (2002) as a version of DIC, but without formal 
justification. 

Other information criteria are based on Fisher's asymptotic theory assuming a regular 
model for which the likelihood or the posterior converges to a single point and MLE, MAP, 
and plug-in estimates are asymptotically equivalent. With singular models the set of true 
parameters consists of more than one point, the Fisher information matrix is not positive 
definite, plug-in estimates are not representative of the posterior and the distribution of the 
deviance does not converge to a xt distribution. 

Watanabe shows that the Bayesian generalization utility can be estimated by a criterion 



WAIC G = BU t - 2(BU t - GU t ), 
where BUt is Bayes training utility 



1 n 

BU t = -y j \ogp{y i \D,M k ) 



(86) 



(87) 



i=l 

and GU g is Gibbs generalization utility 

GU g = J Pt (y) J p(d\D,M k )logp(y\9,M k )dedy 

WAIC can also be given as a functional variance form 

WAICy = BU t ~ V/n, 

where the functional variance 



(89) 



\D,M k 



(iogp( yi \xi,e,M k )Y 



i=i 



{ E e\D,M k [ l °SP(yi\xi,0,M k )})' 



(90) 



describes the fluctuation of the posterior distribution. 

WAIC is asymptotically equal to the true logarithmic utility in both regular and singular 



statistical models and the error in a finite case is o(l/n). Watanabe (2010b) shows also that 



the WAIC estimate is asymptotically equal to the Bayesian cross-validation estimate (section 



12.2). WAICg and WAICy are asymptotically equal, but the series expansion of WAICy 



has closer resemblance to the series expansion of the logarithmic leave-one-out utility. 



12.5 Model assessment demos 

The model assessment methods are demonstrated with the functions demo_modelassesmentl 
and demo_modelassesment2. The former compares the sparse GP approximations to the 
full GP with regression data and the latter compares the logit and probit likelihoods in GP 
classification. 

Assume that we have built our regression model with a Gaussian noise and used opti- 
mization method to find the MAP estimate for the parameters. We evaluate the effective 
number of latent variables, DIC and WAIC 



p_eff_latent = gp_peff(gp, x, y) ; 

[DIC_latent, p_ef f _latent2] = gp_dic(gp, x, y, 'focus', 'latent'); 
WAIC = gp_waic(gp,x,y) ; 



where p e g is evaluated with two different approximations. Since we have the MAP estimate 
for the parameters the focus is on the latent variables. In this case we can also use gp_pef f 
which returns the effective number of parameters approximated as 

p D {9) « n - tr(K"A(K^ f +cj- 2 1)- 1 ) (91) 



( Spiegelhalter et al. 2002). When the focus is on the latent variables, the function gp_dic 



evaluates the DIC statistics and the effective number of parameters as described by the 



equations ( 84 ) and ( 85 ) . 



The £;-fold-CV expected utility estimate can be evaluated as follows: 
cvres = gp_kfcv(gp, x, y) ; 

The gp_kfcv takes the ready made model structure gp and the training data x and y. 
The function divides the data into k groups, conducts inference separately for each of the 
training groups and evaluates the expected utilities with the test groups. Since no optional 
parameters are given the inference is conducted using MAP estimate for the parameters. 
The default division of the data is into 10 groups. The expected utilities and their variance 
estimates are stored in the structure cvres. gp_kfcv returns also other statistics if more 
information is needed and the function can be used to save the results automatically. 

Assume now that we have a record structure from gp_mc function with Markov chain 
samples of the parameters stored in it. In this case, we have two options how to evaluate 
the DIC statistics. We can set the focus on the parameters or all the parameters (that 
is parameters and latent variables). The two versions of DIC and the effective number of 
parameters and WAIC are evaluated as follows: 

r gP = gP-mc(gp, x, y, opt); 

[DIC, p_eff] = gp_dic(rgp, x, y, 'focus', 'param'); 
[DIC2, p_eff2] = gp_dic(rgp, x, y, 'focus', 'all'); 
WAIC = gp_waic(gp,x,y) ; 

Here the first line performs the MCMC sampling with options opt. The next two lines 
evaluate the DIC statistics and the last line evaluates WAIC. With Markov chain samples, 
we cannot use the gp_peff function to evaluate Pd{6) since that is a special function for 
models with fixed parameters. 

The functions gp_pef f , gp_dic and gp_kf cv work similarly for non-Gaussian likelihoods 
as for a Gaussian one. The only difference is that the integration over the latent variables 
is done approximately. The way the latent variables are treated is defined in the field 
latent_method of the GP structure and this is initialized when constructing the model as 



discussed in the section 6.2 The effective number of parameters returned by gp_pef f is 
evaluated as in the equation (91) with the modification that <7" 2 I iS replaced by W in the 



case of Laplace approximation and S 1 in the case of EP. 



13. Adding new features in the toolbox 



As described in the previous sections, GPstuff is build modularly so that the full model 
is constructed by combining separate building blocks. Each building block, such as the 
covariance function or likelihood, is written in a m-file which contains all the functions 
and parameters specific to it. This construction makes it easier to add new model blocks 
in the package. For example, new covariance functions can be written by copying one of 
the existing functions to a new file and modifying all the subfunctions and parameters 
according to the new covariance function. With likelihoods it should be remembered that 
GPstuff assumes currently that they factorize as described in the equation ([I]). All the 
inference methods should work for log-concave likelihoods. However, likelihood functions 
that are not log-concave may lead to problems with Laplace approximation and EP since 
the conditional posterior of the latent varia bles may be multimodal and the diagonal matrices 



W and 5] may contain negative elements (Vanhatalo et al. 2009). For example, Student-i 



observation model leads to a likelihood, which is not log-concave, and requires some specific 
modifications to the Laplace approximation and EP algorithm (see Jylanki et al. 2011). 
Using MCMC should, however, be straightforward with non-log-concave likelihoods as well. 
See Appendix [F] for technical details of lik, gpcf and prior functions, and argument 
interface for optimisation and sampling functions. 

The package is not as modular with respect to the computational techniques as to the 
model blocks. Usually, each inference method requires specific summaries from the model 
blocks. For example, the EP algorithm requires that each likelihood has a function to 
evaluate integrals such as f fip(yi\fi, (j>)N{fi\n, cr 2 )dfi whereas the Laplace approximation 
requires the Hessian of the log likelihood, Vf logp(y | f , <p). For this reason adding, for exam- 
ple, variational approximation would require modifications also in the likelihood functions. 



14. Discussion 



Broadly thinking, GPs have been researched intensively for decades and many disciplines 
have contributed to their study. They have not, necessarily, been called GP but, for ex- 
ample, spatial statistics, signal processing and filtering literature have their own naming 
conventions. The point of view taken in GPstuff comes mainly from the machine learning 
literature where the subject has flourished to a hot topic since late 1990's. Our aim in the 
GPstuff package has been to collect existing and create new practical tools for analyzing 
GP models. We acknowledge that most of the basic models and algorithms have been pro- 
posed by others. However, the implementation in GPstuff is unique and contains several 
practical solutions for computational problems that are not published elsewhere. Our own 



contribution on GP theory that is included in GPstuff is described mainly in (Vanhatalo 



and Vehtari 


2007 


2008 


2010 



2011) 



Our intention has been to create a toolbox with which useful, interpretable results can be 
produced in a sensible time. GPstuff provides approximations with varying level of accuracy. 
The approximation to be used needs to be chosen so that it approximates well enough 
the essential aspects in the model. Thus, the choice is always data, model, and problem 
dependent. For example, MCMC methods are often praised for their asymptotic properties 
and seemingly easy implementation. Algorithms, such as Metropolis-Hastings, are easy to 



implement for most models. The problem, however, is that as a data set grows they may 
not give reliable results in a finite time. With GP models this problem is faced severely. For 
example, currently problems with over 10 000 data points would be impractical to analyse 
with a standard office PC using MCMC since the convergence rate and mixing of the sample 
chain would be too slow. For this reason it is important to have also faster, but perhaps 
less accurate, methods. The choice of the method is then a compromise between time and 
accuracy The inference is the fastest when using MAP estimate for the parameters and a 
Gaussian function for the conditional posterior. With a Gaussian likelihood, the Gaussian 
conditional distribution is exact and the only source of imprecision is the point estimate 
for the parameters. If the likelihood is other than Gaussian, the conditional distribution 
is an approximation, whose quality depends on how close to Gaussian the real conditional 
posterior is, and how well the mean and covariance are approximated. The form of the real 
posterior depends on many things for which reason the Gaussian approximation has to be 
assessed independently for every data. 

One of our aims with GPstuff has been to provide an easy way to detect model mis- 
specifications. This requires the model assessment tools discussed in the section 12 and an 
easy way to test alternative model structures. The model misfit most often relates either to 
the likelihood or GP prior. For example, if the data contained outliers or observations with 
clearly higher variance than what the Gaussian or Poisson observation model predicts, the 
posterior of the latent function would be highly compromised. For this reason, robust alter- 
natives for traditional models, such as Student-i or negative binomial distribution should be 
easily testable. Even though the GP prior is very flexible and only the mean and covariance 
need to be fixed, it still contains rather heavy assumptions. For example, GP associated 
with the squared exponential covariance function is indefinitely mean square differentiable. 
This is a very strong assumption on the smoothness of the latent function. In fact it is rather 
peculiar how little attention other covariance functions have gained in machine learning lit- 
erature. One of the reasons may be that often machine learning problems take place in high 
dimensional input spaces where data are enforced to lie sparsely and for which reason the 
possible solutions are smooth. However, the covariance function sometimes does influence 
the results 



(see e.g. 



Vanhatalo and Vehtari, 2008, Vanhatalo et al. 2010 1. 



In statistics, inference in the parameters is a natural concern but in machine learning 
literature they are left in less attention. An indicator of this is the usual approach to 
maximize the marginal likelihood which implies uniform prior for the parameters. GPstuff 
provides an easy way to define priors explicitly so that people would really use them (this 
principle is also in line with reasoning by Gelman 2006). We want to stress a few reasons 
for explicitly defining a prior. In spatial statistics, it is well known that the length-scale 
and magnitude are underidentifiable and the proportion a 2 /I is more important to the 



predictive performance than their individual values (Diggle et al. 1998 Zhang 2004 Diggle 



and Ribeiro 2007). These results are shown for Matern class of covariance functions but 



according to our experiments they seem to apply for Wendland's piecewise polynomials as 
well. With them, the property can be taken advantage of since by giving more weight to 
short length-scales one favors sparser covariance matrices that are faster in computations. 
Other advantage is that priors make the inference problem easier by narrowing the posterior 
and making the parameters more identifiable. This is useful especially for MCMC methods 
but optimization and other integration approximations gain from the priors as well. These 



two reasons are rather practical. More fundamental reason is that in Bayesian statistics 
leaving prior undefined (meaning uniform prior) is a prior statement as well, and sometimes 
it may be really awkward. For example, uniform prior works very badly for the parameters 
of the Student-t distribution. Thus, it is better to spend some time thinking what the prior 
actually says than blindly use the uniform. 



Much could be done to improve the inference algorithms in GPstuff. For example, the 
shape of the conditional posterior of a latent variable could be estimated more accurately 



with techniques described by, for example, Rue et al. 



Cseke and Heskes (2010) (see also Tierney and Kadane 



(2009), Paquet et al. (2009), and 



1986). Improvements compromise 



the computational speed but are essential for more reliable results, for which reason these 
techniques provide an important future extension for the toolbox. There are also other 
analytic approximations than the Laplace approximation or EP proposed in the literature. 



2000 



Most of these are based on some kind of variational approximation ( Gibbs and Mackay 
Csato and Opper] |2002 Tipping and Lawrence 2005 Kuss 2006| Opper and Arcl 
2009). Laplace approximation and EP were chosen to GPstuff for a few reasons. They 
both are, in theory, straightforward to implement for any factorizable likelihood (although 
sometimes there are practical problems with the implementation). Laplace approximation 
is also fast and EP is shown to work well in many problems. Here, we have considered 
parametric observation models but non-parametric versions are possible as well. Examples 



of possible extensions to that direction are provided by Snelson et al. (2004) and Tokdar 



et al. (2010). 



GPstuff is an ongoing project and the package will be updated whenever new function- 



alities are ready. The latest version of the toolbox is available from http : //bees . aalto . 



f i/en/research/bayes/gpst uff/j 
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Appendix A. Comparison of features in GPstuff, GPML and FBM 
Appendix B. Covariance functions 

In this section we summarize all the covariance functions in the GPstuff package. 
Squared exponential covariance function (gpcf _sexp) 

Probably the most widely-used covariance function is the squared exponential (SE) 

fc(x„ x,) = aL P exp g fo fc - g * fc)2 ) . (92) 

The length scale 1^ governs the correlation scale in input dimension k and the magnitude 
CT sexp the overall variability of the process. A squared exponential covariance function leads 
to very smooth GPs that are infinitely times mean square differentiable. 

Exponential covariance function (gpcf _exp) 

Exponential covariance function is defined as 



/c(xj,Xj) — <7g xp exp | , ^ , 

\ k=l 



^2 Kfc ^fc) 2 1 . (93) 



k 

The parameters Ik and <7g Xp have similar role as with the SE covariance function. The expo- 
nential covariance function leads to very rough GPs that are not mean square differentiable. 

Matern class of covariance functions (gpcf _maternXX) 

The Matern class of covariance functions is given by 

T(i/) 



k v {x i ,x j ) = <4s FnK [y/2vr) K v [V2ur) , (94) 



where r = ( Yl^—l ^' ,fc ^ . The parameter v governs the smoothness of the process, 



and K y is a modified Bessel function (Abramowitz and Stegun 1970 sec. 9.6). The Matern 
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Table 1: The comparison of features in GPstuff (v4.1), GPML (v3.2) and FBM 
(2004-11-10) toolboxes. In case of model blocks the notation x means that it can 
be inferred with any inference method (EP, LA (Laplace), MCMC and in case of 
GPML also with VB). In case of sparse approximations, inference methods and 
model assessment methods x means that the method is available for all model 
blocks. 



covariance functions can be represent in a simpler form when v is a half integer. The Matern 
covariance functions with v = 3/2 (gpcf _matern32) and v = 5/2 (gpcf _matern52) are: 



^=3/2( x i> x j) = <?m f 1 + ex P (~^ ? 

*V=5/2(xi,Xj) = a™ ^1 + \/5r + exp 



(95) 
(96) 



Neural network covariance function (gpcf _neuralnetwork) 

A neural network with suitable transfer function and prior distribution converges to a GP 



as the number of hidden units in the network approaches to infinity (Neal 1996 Williams 



1996 Rasmussen and Williams 2006). A nonstationary neural network covariance function 



is 



k (x^ ) x j 



2x7 £x,- 



sin 



7T 



(l + 2 X T£x i )(l + 2xj£x i 



(97) 



where x = (1, x\, . . . , Xd) T is an input vector augmented with 1. £ = diag(cig, af, . . . , u^) is 
a diagonal weight prior, where <Tq is a variance for bias parameter controlling the functions 
offset from the origin. The variances for weight parameters are af, . . . ,a^, and with small 
values for weights, the neural network covariance function produces smooth and rigid looking 
functions. The larger values for the weight variances produces more flexible solutions. 

Constant covariance function (gpcf _constant) 

Perhaps the simplest covariance function is the constant covariance function 



k ( x i ? Xj 



(98) 



with variance parameter a 2 . This function can be used to implement the constant term in 



the dot-product covariance function ( Rasmussen and Williams 2006 ) reviewed below 



Linear covariance function (gpcf _linear) 

The linear covariance function is 



k (x j , Xj ) — x j Sxj 



(99) 



where the diagonal matrix £ = diag(<r^, . . . , alfr) contains the prior variances of the linear 
model coefficients. Combining this with the constant function above we can form covariance 



function Rasmussen and Williams (2006), which calls a dot-product covariance function 



k (x^ , xj 



a 2 + x^Sxj. 



(100) 



Piecewise polynomial functions (gpcf _ppcsX) 

The piecewise polynomial functions are the only compactly supported covariance functions 



(see section 



10) in GPstuff. There are four of them with the following forms 

k pp0 (xi,Xj) =a 2 (l - r) 3 + 

kppi(xi, xj) =a 2 (l - r) J + +1 ((j + l)r + 1) 



a 



/c pp2 ( Xi , Xj ) = — (1 - r) J + +2 ((j 2 + 4j + 3)r 2 + (3j + 6)r + 3) 



a 



fcppaCxi, Xj-) = — (1 - r)^((j d + 9j 2 + 23j + 15)r d + 
(6j 2 + 36j + 45)r 2 + (15j + 45)r + 15) 



(101) 
(102) 

(103) 
(104) 



where j = [d/2\ + q + 1. These functions correspond to processes that are 2q times mean 



square differentiable at the zero and positive definite up to the dimension d (Wendland 



2005). The covariance functions are named gpcf_ppcs0, gpcf_ppcsl, gpcf_ppcs2, and 



gpcf _ppcs3. 

Rational quadratic covariance function (gpcf_rq) 



The rational quadratic (RQ) covariance function (Rasmussen and Williams 2006) 

fc R Q(Xj,Xj) = 




(105) 



can be seen as a scale mixture of squared exponential covariance functions with different 
length-scales. The smaller the parameter a > is the more diffusive the length-scales of the 
mixing components are. The parameter Ik > characterizes the typical length-scale of the 
individual components in input dimension k. 



Periodic covariance function (gpcf _periodic) 

Many real world systems exhibit periodic phenomena, which can be modelled with a periodic 
covariance function. One possible construction (Rasmussen and Williams 2006) is 

/ d 



k(*i,Xj) = exp 1-^2 



2sm 2 (ir(x ijk - x jtk )/j) 



k=l 



f 2 
'k 



(106) 



where the parameter 7 controls the inverse length of the periodicity and l k the smoothness 
of the process in dimension k. 



Product covariance function (gpcf _product) 

A product of two or more covariance functions, &i(x, x') • /^(x, x')..., is a valid covariance 
function as well. Combining covariance functions in a product form can be done with 
gpcf _prod, for which the user can freely specify the covariance functions to be multiplied 
with each other from the collection of covariance functions implemented in GPstuff. 



Categorical covariance function (gpcf _cat) 

Categorical covariance function gpcf _cat returns correlation 1 if input values are equal and 
otherwise. 

n I 1 if X; — X; = , 
fe(x i ,x,)= 1 . J (107) 

I otherwise 

Categorical covariance function can be combined with other covariance functions using 
gpcf _prod, for example, to produce hierarchical models. 

Appendix C. Observation models 

Here, we summarize all the observation models in GPstuff. Most of them are implemented in 
files lik_* which reminds that at the inference step they are considered likelihood functions. 

Gaussian (lik_gaussian) 

The i.i.d Gaussian noise with variance a 2 is 

y|f,<r 2 ~iV(f,<7 2 I). (108) 
Student-i (lik_t, lik_gaussiansmt) 

The Student-i observation model (implemented in lik_t) is 

jl r((i/ + i)/2) / ( yi - fi) 2 Y (u+1)/2 , \ 

y f,v, at ~]J -±— /L 1 1+ ^ , 109 

where v is the degrees of freedom and a the scale parameter. The scale mixture version of 
the Student-t distribution is implemented in lik_gaussiansmt and it is parametrized as 

Vilfi^Ui^NifuaUi) (110) 
Ui-Inv-x 2 ^^ 2 ), (111) 



where each observation has its own noise variance alii (Neal 1997 Gelman et al. 2004[ ) 



Logit (lik_logit) 

The logit transformation gives the probability for yi of being 1 or —1 as 

ploMfi) = i + eM- mfl y (112) 

Probit (lik_probit) 

The probit transformation gives the probability for yi of being 1 or — 1 as 

1 N{z\Q,\)dz. (113) 

-oo 



Poisson (lik_poisson) 

The Poisson observation model with expected number of cases e is 

n 

y | f,e ~ J|Poisson(yj| exp(/ i )e i ). (114) 
Negative-Binomial (lik_negbin) 

The negative-binomial is a robustified version of the Poisson distribution. It is parametrized 

where ^ = eexp(/j), r is the dispersion parameter governing the variance, ej is the expected 
number of cases and j/j is positive integer telling the observed count. 

Binomial (lik_binomial) 

The binomial observation model with the probability of success pi = exp(/j)/(l + exp(/j)) 
is 

N , 

y| f .-~n ra r^pT(i-ft) ( *'- w) - ( 116 ) 

Here, Zi denotes the number of trials and j/j is the number of successes. 

Weibull (lik_weibull) 

The Weibull likelihood is defined as 

n 

L = H r 1 -** exp ((1 - + U " *i)(r - 1) logfa) - exp(/(x,))y[) , (117) 

i=i 

where z is a vector of censoring indicators with Z{ = for uncensored event and = 1 for 
right censored event for observation i and r is the shape parameter. Here we present only 
the likelihood function because we don't have observation model for the censoring. 

Log-Gaussian (lik_loggaussian) 

The Log-Gaussian likelihood is defined as 

L = n^ 2 )^ 1 -^)/ 2 ^ exp (-^(1 - *i)(log(yi) - /(x,)) 2 ) (118) 



(7 

where a is the scale parameter. 



Log-logistic (lik_loglogistic) 

The log-logistic likelihood is defined as 



ry \ (, t I y 



^nufej ( 1+ (™JJ (119) 

Cox proportional hazard model (lik_coxph) 

The likelihood contribution for the possibly right censored ith observation (j/j, Si) is assumed 
to be 

k = hiiyi)^^ exp (- J" hi(t)dtj . (120) 

Using the piecewise log-constant assumption for the hazard rate function, the contribution 
of the observation i for the likelihood results in 

I ^ \ 

k = [A fc exp(??i)] (1 il) exp I -[(yj - s fc _!)A fe + - s 9 _i)A 9 ]exp(?7j) I , (121) 

where y { G (s fe _i,s fe ] 

Input-dependent noise (lik_inputdependentnoise) 

The input-dependent noise observation model is defined as 

n 

y | f W f W a 2 ~ [J NiVilfP," 2 *Mf? >)), (122) 
i=i 

with latent function defining the mean and defining the variance. 

Input-dependent Weibull (lik_inputdependentweibull) 

The input-dependent Weibull observation model is defined as 

n 

y | f « f ( 2 ), z ~ ex P(/f exp ((1 - ^if (123) 



i=l 



+ (1 - Zi)(exp(/f } ) - 1) logfo) exp(/( 1) )y i cxp(/ ^ )) 

where z is a vector of censoring indicators with Zi = for uncensored event and = 1 for 
right censored event for observation i. 

Zero-inflated Negative-Binomial (lik_zinegbin) 

The Zero-Inflated Negative-Binomial observation model is defined as 

where jjn = e,exp(/(xj)) and r is the dispersion parameter governing the variance. 



Appendix D. Priors 

This appendix lists all the priors implemented in the GPstuff package. 



Gaussian prior (prior_gaussian) 

The Gaussian distribution is parametrized as 

1 / 1 



^^^^{-^-"^ (125) 



where \x is a location parameter and a 2 is a scale parameter. 

Log-Gaussian prior (prior_loggaussian) 

The log-Gaussian distribution is parametrized as 



pW = ^l=exp(-^(log W -^) (126) 



where \x is a location parameter and a 2 is a scale parameter. 

Laplace prior (prior_laplace) 

The Laplace distribution is parametrized as 



p(*) .' expf - ) (J27) 



2a r V a 

where \x is a location parameter and a > is a scale parameter 

Student-i prior (prior_t) 

The Student-t distribution is parametrized as 



r((i/ + i)/2) / | (e - tfyw* 

T(y/2)\fvKcr 2 \ vo 2 



where /x is a location parameter, cr is a scale parameter and v > is the degrees of freedom. 

Square root Student-t prior (prior_sqrtt) 

The square root Student-t distribution is parametrized as 



p(° 1/2 ) = :rl * ) (129) 



r((^ + i)/2) A + (g-/,) 2 y ( - +1)/2 

where /i is a location parameter, a 2 is a scale parameter and v > is the degrees of freedom. 



Scaled inverse-^ 2 prior (prior_sinvchi2) 

The scaled inverse-x 2 distribution is parametrized as 

= ^/^l {s 2y/2 e -H2+l) e -usy m (13Q) 

1 [y 1 2) 

where s 2 is a scale parameter and v > is the degrees of freedom parameter. 

Gamma prior (prior_gamma) 

The gamma distribution is parametrized as 

vie) = J^-e^e-? 6 (i3i) 

where a > is a shape parameter and f3 > is an inverse scale parameter. 

Inverse-gamma prior (prior_invgamma) 

The inverse-gamma distribution is parametrized as 

p{9) = ^(?-(«+i) e -^ (132) 
r(a) 

where a > is a shape parameter and /3 > is a scale parameter. 

Uniform prior (prior_unif) 

The uniform prior is parametrized as 

p{0) oc 1. (133) 

Square root uniform prior (prior_sqrtunif ) 

The square root uniform prior is parametrized as 

pie 1 / 2 ) oc 1. (134) 

Log-uniform prior (prior_logunif ) 

The log-uniform prior is parametrized as 

p(log(0)) oc 1. (135) 

Log- log- uniform prior (prior_loglogunif ) 

The log-log-uniform prior is parametrized as 

p(log(log(0))) oc 1. (136) 



Appendix E. Transformation of hyperparameters 

The inference on the parameters of covariance functions is conducted mainly transformed 
space. Most of ten used transformation is log-transformation, which has the advantage that 
the parameter space (0, oo) is transformed into (—00,00). The change of parametrization 
has to be taken into account in the evaluation of the probability densities of the model. If 
parameter 9 with probability density p$(9) is transformed into the parameter w = f{9) with 
equal number of components, the probability density of w is given by 



(w) = \J\ Pe (r 1 (w)), 



(137) 



where J is the Jacobian of the transformation 9 = f^ 1 (w). The parameter transformations 
are discussed shortly, for example, in Gelman et al. ( 2004| ) [p. 24]. 

Due to the log transformation w = log(0) transformation the probability densities po(9) 
are changed to the densities 



Pw{w) = \J\pe(exp(w)) = \J\pe(9), 



(138) 



where the Jacobian is J 



dexp(w) 
dw 



exp(w) = 9. Now, given Gaussian observation model 

(139) 



(see Section 3.1) the posterior of w can be written as 

p w (w\V)^p{y\X,9)p(9\j)9, 

which leads to energy function 

E(w) = - logp(y|X, 9) - logp(0| 7 ) - log(|0|). 
= £(0)-log(0), 

where the absolute value signs are not shown explicitly around 9 because it is strictly positive. 
Thus, the log transformation just adds term — \og9 in the energy function. 

The inference on w requires also the gradients of an energy function E(w). These can 
be obtained easily with the chain rule 



dEjw) 
dw 



dE{9) aiog(|J|) 



39 
dE{9) 
39 



09 

j_din 

|J| 39 



39 
dw 



J. 



(140) 



Here we have used the fact that the last term, derivative of 9 with respect to w, is the same 
as the Jacobian J = = ^ w . Now in the case of log transformation the Jacobian can be 
replaced by 9 and the gradient is gotten an easy expression 



3E{w) 3E{9) 



dw 



39 



1. 



(141) 



Appendix F. Developer appendix 



This section provides additional description of lik, gpcf and prior functions, and argument 
interface for optimisation and sampling functions. 

F.l lik functions 

New likelihoods can be added by copying and modifying one of the existing lik functions. 
We use lik_negbin as an example of log-concave likelihood and lik_t as an example of 
non-log-concave likelihood. Note that adding a new non-log-concave likelihood is more 
challenging as the posterior of the latent values may be multimodal, which makes it more 



difficult to implement stable Laplace and EP methods (see, e.g., Vanhatalo et al. 2009 



Jylanki et al. 2011) 



F.l.l lik_negbin 

inputParser is used to make argument checks and set some default values. If the new 
likelihood does not have parameters, see for example, lik_poisson. 

ip=inputParser ; 

ip.FunctionName = 'LIK_NEGBIN' ; 

ip. addOptional( 'lik' , [] , Oisstruct) ; 

ip. addParamValue ( 'disper' , 10, @(x) isscalar(x) && x>0) ; 

ip. addParamValue ( 'disper_prior ' ,prior_logunif () , @(x) isstruct(x) || isempty(x)) ; 
ip . parse (varargin{ : }) ; 
lik=ip. Results . lik; 

Function handles to the subfunctions provide object-oriented behaviour (at time when 
GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks 
proper support). If some of the subfunctions have not been implemented, corresponding 
function handles should not be defined. 

if init 

% Set the function handles to the subfunctions 

lik.fh.pak = @lik_negbin_pak; 

lik.fh.unpak = @lik_negbin_unpak; 

lik.fh.lp = @lik_negbin_lp; 

lik.fh.lpg = @lik_negbin_lpg; 

lik.fh.ll = @lik_negbin_ll; 

lik.fh.llg = @lik_negbin_llg; 

lik.fh.llg2 = @lik_negbin_llg2; 

lik.fh.llg3 = @lik_negbin_llg3; 

lik.fh.tiltedMoments = @lik_negbin_tiltedMoments ; 
lik.fh. siteDeriv = @lik_negbin_siteDeriv; 
lik.fh.predy = @lik_negbin_predy ; 
lik.fh.predprcty = @lik_negbin_predprcty ; 
lik.fh. invlink = @lik_negbin_invlink; 



lik.fh.recappend = @lik_negbin_recappend; 
end 

end 

lik_negbin_pak and lik_negbin_unpak functions are used to support generic optimiza- 
tion and sampling functions, which assume that the parameters are presented as a vector. 
Parameters which have empty prior ([]) are ignored. These subfunctions are mandatory 
(even if there are no parameters). 

lik_negbin_lp and lik_negbin_lpg compute the log prior density of the parameters 
and its gradient with respect to parameters. Parameters which have empty prior ([]) are 
ignored. These subfunctions are needed if there are likelihood parameters. 

lik_negbin_ll and lik_negbin_llg compute the log likelihood and its gradient with 
respect to parameters and latents. Parameters which have empty prior ([]) are ignored. 
These subfunctions are mandatory. 

Iik_negbin_llg2 computes second gradients of the log likelihood. This subfunction is 
needed when using Laplace approximation or EP for inference with non-Gaussian likelihoods. 

Iik_negbin_llg3 computes third gradients of the log likelihood. This subfunction is 
needed when using Laplace approximation for inference with non-Gaussian likelihoods. 

lik_negbin_tiltedMoments returns the marginal moments for the EP algorithm. This 
subfunction is needed when using EP for inference with non-Gaussian likelihoods. 

lik_negbin_siteDeriv evaluates the expectation of the gradient of the log likelihood 
term with respect to the likelihood parameters for EP. This subfunction is needed when 
using EP for inference with non-Gaussian likelihoods and there are likelihood parameters. 

lik_negbin_predy returns the predictive mean, variance and density of y. This subfunc- 
tion is needed when computing posterior predictive distributions for future observations. 

lik_negbin_predprcty returns the percentiles of predictive density of y. This subfunc- 
tion is needed when using function gp_predprcty. 

lik_negbin_init_negbin_norm is a private function for lik_negbin. It returns function 
handle to a function evaluating Negative-Binomial * Gaussian which is used for evaluating 
(likelihood * cavity) or (likelihood * posterior). This subfunction is needed by subfunctions 
lik_negbin_tiltedMoments, lik_negbin_siteDeriv and lik_negbin_predy. Note that 
this is not needed for those likelihoods for which integral over the likelihood times Gaussian 
is computed using other than quadrature integration. 

lik_negbin_invlink returns values of inverse link function. This subfunction is needed 
when using function gp_predprctmu. 

lik_negbin_recappend This subfunction is needed when using MCMC sampling (gp_mc). 

F.1.2 lik_t 

lik_t includes some additional subfunctions useful for non-log-concave likelihoods. 

lik_t_tiltedMoments2 returns the marginal moments for the EP algorithm. This sub- 
function is needed when using robust-EP for inference with non-Gaussian likelihoods. 

lik_t_siteDeriv2 evaluates the expectation of the gradient of the log likelihood term 
with respect to the likelihood parameters for EP. This subfunction is needed when using 
robust-EP for inference with non-Gaussian likelihoods and there are likelihood parameters. 



lik_t_optimizef function to optimize the latent variables with EM algorithm. This 
subfunction is needed when using likelihood specific optimization method for mode finding 
in the Laplace algorithm. 

F.2 gpcf functions 

New covariance functions can be added by copying and modifying one of the existing gpcf 
functions. We use gpcf _sexp as an example. 

F.2.1 gpcf_sexp 

inputParser is used to make argument checks and set some default values. If the new 
covariance function does not have parameters, corresponding lines can be removed (see, 
e.g., lik_cat). 

ip=inputParser ; 

ip.FunctionName = 'GPCF_SEXP'; 

ip. addOptional( 'gpcf ' , [] , Oisstruct) ; 

ip. addParamValue( 'magnSigma2' ,0 . 1 , @(x) isscalar(x) kk x>0) ; 
ip.addParamValue( 'lengthScale' , 1, @(x) isvector(x) kk all(x>0)); 
ip. addParamValue( 'metric' , [] , Oisstruct) ; 
ip. addParamValue( 'magnSigma2_prior ' , prior_logunif () , ... 

@(x) isstruct(x) || isempty(x)) ; 
ip. addParamValue( 'lengthScale_prior ' ,prior_t () , . . . 

@(x) isstruct(x) || isempty(x)) ; 
ip. addParamValue ( 'selectedVariables ' , [] , @(x) isempty(x) || ... 

(isvector(x) kk all(x>0))); 
ip . parse (varargin{ : }) ; 
gpcf =ip . Results . gpcf ; 

Optional 'metric' can be used to replace default simple euclidean metric. Currently it is 
possible to use delta distance or group covariates to have common lengthScale parameters. 
Other potential metrics could be, for example, Manhattan or distance matrix based metrics. 
The metric function is called only if it has been explicitly set. This adds some additional 
code branches to the code, but avoids extra overhead in computation when using simple 
euclidean metric case. 

Function handles to the subfunctions provide object-oriented behaviour (at time when 
GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks 
proper support). If some of the subfunctions have not been implemented, corresponding 
function handles should not be defined. 

if init 

°/ Set the function handles to the subfunctions 
gpcf.fh.pak = Ogpcf _sexp_pak; 
gpcf . f h . unpak = Ogpcf _sexp_unpak; 
gpcf.fh.lp = Ogpcf _sexp_lp; 



gpcf.fh.lpg= Ogpcf _sexp_lpg; 
gpcf.fh.cfg = Ogpcf _sexp_cf g; 
gpcf.fh.cfdg = Ogpcf _sexp_cfdg; 
gpcf . fh. cfdg2 = Ogpcf _sexp_cfdg2; 
gpcf . f h . ginput = Ogpcf _sexp_ginput ; 
gpcf . fh.ginput2 = Ogpcf _sexp_ginput2; 
gpcf . fh.ginput3 = Ogpcf _sexp_ginput3; 
gpcf . fh.ginput4 = ©gpcf _sexp_ginput4; 
gpcf.fh.cov = Ogpcf _sexp_cov; 
gpcf . fh.trcov = Ogpcf _sexp_trcov; 
gpcf . fh.trvar = Ogpcf _sexp_trvar; 
gpcf . fh.recappend = Ogpcf _sexp_recappend; 
end 



gpcf _sexp_pak and gpcf _sexp_unpak functions are used to support generic optimization 
and sampling functions, which assume that parameters are presented as a vector. Parameters 
which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there 
are no parameters). 

gpcf _sexp_lp and gpcf _sexp_lpg compute the log prior density of the parameters and 
its gradient with respect to parameters. Parameters which have empty prior ([]) are ignored. 
These subfunctions are mandatory. 

gpcf _sexp_cov evaluates covariance matrix between two input vectors. This is a manda- 
tory subfunction used for example in prediction and energy computations. 

gpcf _sexp_trcov evaluates training covariance matrix of inputs. This is a manda- 
tory subfunction used for example in prediction and energy computations. If available, 
gpcf _sexp_trcov uses faster C-implementation trcov for covariance computation (see lin- 
uxCsource/trcov.c). trcov includes C-code implementation for all currently available co- 
variance functions. If trcov is not available, gpcf _sexp_trcov uses M-code computation. 

gpcf _sexp_trvar evaluates training variance vector. This is a mandatory subfunction 
used for example in prediction and energy computations. 

gpcf _sexp_cf g evaluates gradient of covariance function with respect to the parameters, 
gpcf _sexp_cf g has four different calling syntaxes. First one is a mandatory used in gradient 
computations. Second and third are needed when using sparse approximations (e.g. FIC). 
Fourth one is needed when using memory save option in gp_set (without memory save 
option, all matrix derivatives with respect to all covariance parameters are computed at 
once, which may take lot of memory if n is large and there are many parameters). 

gpcf _sexp_cf dg evaluates gradient of covariance function, of which has been taken par- 
tial derivative with respect to x, with respect to parameters. This subfunction is needed 
when using derivative observations. 

gpcf _sexp_cf dg2 evaluates gradient of covariance function, of which has been taken 
partial derivative with respect to both input variables x, with respect to parameters. This 
subfunction is needed when using derivative observations. 

gpcf _sexp_ginput evaluates gradient of covariance function with respect to x. This 
subfunction is needed when computing gradients with respect to inducing inputs in sparse 
approximations . 



gpcf _sexp_ginput2 evaluates gradient of covariance function with respect to both input 
variables x and x2. This subfunction is needed when computing gradients with respect to 
both input variables x and x2 (in same dimension). This subfunction is needed when using 
derivative observations. 

gpcf _sexp_ginput3 evaluates gradient of covariance function with respect to both input 
variables x and x2. This subfunction is needed when computing gradients with respect to 
both input variables x and x2 (in different dimensions). This subfunction is needed when 
using derivative observations. 

gpcf _sexp_ginput4 evaluates gradient of covariance function with respect to x. Sim- 
plified and faster version of sexp_ginput, returns full matrices. This subfunction is needed 
when using derivative observations. 

gpcf _sexp_recappend is needed when using MCMC sampling (gp_mc). 

F.3 prior functions 

New priors can be added by copying and modifying one of the existing prior functions. We 
use prior_t as an example. 

inputParser is used to make argument checks and set some default values. If the new 
prior does not have parameters, corresponding lines can be removed (see, e.g., prior_unif). 

ip=inputParser ; 

ip.FunctionName = 'PRI0R_T' ; 

ip. addOptionaK 'p' , [] , Oisstruct) ; 

ip. addParamValue( 'mu' ,0, @(x) isscalar (x) ) ; 

ip. addParamValue( 'mu_prior' , [] , @(x) isstruct(x) || isempty(x)); 

ip. addParamValue( 's2' , 1 , @(x) isscalar (x) kk x>0) ; 

ip. addParamValue( 's2_prior' , [] , @(x) isstruct(x) || isempty (x) ) ; 

ip. addParamValue ( 'nu' ,4, @(x) isscalar(x) kk x>0) ; 

ip. addParamValue( 'nu_prior' , [] , @(x) isstruct(x) || isempty (x) ) ; 

ip . parse (varargin{ : }) ; 

p=ip. Results. p; 

Function handles to the subfunctions provide object-oriented behaviour (at time when 
GPstuff project was started, real classes in Matlab were too inefficient and Octave still lacks 
proper support). If some of the subfunctions have not been implemented, corresponding 
function handles should not be defined. 

if init 

°/ set functions 

p.fh.pak = @prior_t_pak; 

p.fh.unpak = @prior_t_unpak; 

p.fh.lp = @prior_t_lp; 

p.fh.lpg = @prior_t_lpg; 

p . f h. recappend = @prior_t_recappend; 
end 



prior_t_pak and prior_t_unpak functions are used to support generic optimization and 
sampling functions, which assume that parameters are presented as a vector. Parameters 
which have empty prior ([]) are ignored. These subfunctions are mandatory (even if there 
are no parameters). 

prior_t_lp and prior_t_lpg compute the log prior density of the parameters and its 
gradient with respect to parameters. Parameters which have empty prior ([]) are ignored. 
These subfunctions are mandatory. 

prior_t_recappend This subfunction is needed when using MCMC sampling (gp_mc). 

F.4 Optimisation functions 

Adding new optimisation functions is quite easy as covariance function and likelihood pa- 
rameters and inducing inputs can be optimised using any optimisation function using same 
syntax as, for example, fminunc by Mathworks or fminscg and fminlbfgs included in 
GPstuff. 
Syntax is 

X = FMINX (FUN , XO, OPTIONS) 

where FUN accepts input X and returns a scalar function value F and its scalar or vector 
gradient G evaluated at X, XO is initial value and OPTIONS is option structure. 
Then new optimisation algorithm can be used with code 

gp=gp_opt im (gp , x , y , ' opt ' , opt , ' opt imf ' , Of minx) ; 
F.5 Other inference methods 

Adding new Monte Carlo sampling functions is more complicated. gp_mc function is used to 
allow the use of different sampling methods for covariance function parameters, likelihood 
parameters and latent values with one function call. Thus adding new sampling method 
requires modification of gp_mc. Note also the subfunctions gpmc_e and gpmc_g, which return 
energy (negative log posterior density) and its gradient while handling infer _params option 
properly. See gp_mc for further information. 

Adding new analytic approximation, for example, variational Bayes approach, would 
require major effort implementing many inference related functions. To start see functions 
with name starting gpla_ or gpep_ and gp_set. 



